Skip to contents

Background

A complete quantitative stable isotope probing (qSIP) workflow using the qSIP2 package starts with three input files and ends with calculated excess atom fraction (EAF) values along with a ton of intermediate data. This vignette will be a high-level walk through of the major steps with links to more specific vignettes where more detail is appropriate.

The Input Files

Preparing and formatting the input files is often the most tedious part of any analysis. Our goal with the rigid (and opinionated) requirements imposed by qSIP2 will hopefully streamline the creation of these files, and automated validation checks can remove many of the common sources of error or confusion.

Source Data

The source data is the highest level of metadata with a row corresponding to each original experimental or source material object. An example source dataframe is included in the qSIP2 package called example_source_df.

The first few rows of example_source_df
source total_copies_per_g total_dna Isotope Moisture isotopolog
S149 34838665 74.46539 12C Normal glucose
S150 53528072 109.01522 12C Normal glucose
S151 95774992 182.16852 12C Normal glucose
S152 9126192 23.68963 12C Normal glucose
S161 41744046 67.62552 12C Drought glucose
S162 49402713 94.21217 12C Drought glucose

There are a few required columns for valid source data including a unique ID (source_mat_id), an isotope and isotopolog designation for the substrate that had the label. Additional columns can be added as necessary (e.g. Moisture) for grouping and filtering later in the process.

For growth calculations there are three additional requirements - timepoint, total_abundance and volume. These are not necessary for the standard EAF workflow and will instead be addressed in a forthcoming growth vignette.

Once the dataframe is ready, the next step is to convert it to a qsip_source_data object. This is one of the main qSIP2 objects to hold and validate the data. Each of the required columns of metadata is assigned to a column in your dataframe. For example, below the “Isotope” column in the dataframe is assigned to the isotope parameter in the qsip_source_data object.

source_object <- qsip_source_data(example_source_df,
  isotope = "Isotope",
  isotopolog = "isotopolog",
  source_mat_id = "source"
)

class(source_object)
#> [1] "qsip_source_data" "S7_object"

This object modifies some of the column names to standard names as supplied in the above function.

Original Headers Modified Headers
source source_mat_id
total_copies_per_g total_copies_per_g
total_dna total_dna
Isotope isotope
Moisture Moisture
isotopolog isotopolog

If the column names in the dataframe already match the expected standard names, then you can skip assigning them and they should be identified correctly.

A dataframe with the original headers can be recovered using the get_dataframe() method with the original_headers = T option.

get_dataframe(source_object, original_headers = T)
Isotope isotopolog source total_copies_per_g total_dna Moisture
12C glucose S149 34838665 74.46539 Normal
12C glucose S150 53528072 109.01522 Normal
12C glucose S151 95774992 182.16852 Normal
12C glucose S152 9126192 23.68963 Normal
12C glucose S161 41744046 67.62552 Drought
12C glucose S162 49402713 94.21217 Drought

See vignette("source_data") for more details on working with source data.

Sample Data

The sample metadata is the next level of detail with one row for each fraction, or one row for each set of fastq files that were sequenced.

The first few rows of example_sample_df
sample source Fraction density_g_ml dna_conc avg_16S_g_soil
149_F1 S149 1 1.778855 0.0000000 4473.7081
149_F2 S149 2 1.773391 0.0000000 986.6581
149_F3 S149 3 1.765742 0.0000000 4002.7026
149_F4 S149 4 1.759185 0.0000000 3959.7283
149_F5 S149 5 1.752629 0.0012413 5725.7319
149_F6 S149 6 1.746072 0.0128156 7566.2722

Again, there are several necessary columns for valid sample data, including a unique sample ID (sample_id), the source they came from (source_mat_id), the fraction ID (gradient_position), the fraction density (gradient_pos_density) and a measure of abundance (total DNA or qPCR copy number) in that fraction (gradient_pos_amt).

An additional column that can be derived is the percent abundance of your total sample that is found in each of the fractions. The add_gradient_pos_rel_amt() function can help calculate that by dividing each fraction abundance by the total abundance for each source and putting it in a gradient_pos_rel_amt column.

But there is no need to do this if you already have the relative amounts calculated in your dataframe.

sample_df <- example_sample_df |>
  add_gradient_pos_rel_amt(source_mat_id = "source", amt = "avg_16S_g_soil")
Additional gradient_pos_rel_amt column added
sample source Fraction density_g_ml dna_conc avg_16S_g_soil gradient_pos_rel_amt
149_F1 S149 1 1.778855 0.0000000 4473.7081 0.0001284
149_F2 S149 2 1.773391 0.0000000 986.6581 0.0000283
149_F3 S149 3 1.765742 0.0000000 4002.7026 0.0001149
149_F4 S149 4 1.759185 0.0000000 3959.7283 0.0001137
149_F5 S149 5 1.752629 0.0012413 5725.7319 0.0001643
149_F6 S149 6 1.746072 0.0128156 7566.2722 0.0002172

Again, we make a qSIP2 object for this data, this time as a qsip_sample_data object. The columns in the dataframe are assigned to the appropriate parameters, and any column names exactly matching the parameter name will automatically be identified. Similar to the source_data above, the names in the sample_data object will be modified from the original names to the standardized names.

sample_object <- qsip_sample_data(sample_df,
  sample_id = "sample",
  source_mat_id = "source",
  gradient_position = "Fraction",
  gradient_pos_density = "density_g_ml",
  gradient_pos_amt = "avg_16S_g_soil",
  gradient_pos_rel_amt = "gradient_pos_rel_amt"
)

class(sample_object)
#> [1] "qsip_sample_data" "S7_object"

See vignette("sample_data") for more information on working with sample data including the built-in validations.

Feature Data

Finally, the last of the three necessary input files is a feature abundance table, aka “OTU table” or “ASV table”. The format of this dataframe has the unique feature IDs in the first column, and an additional column for each sample. Each row then contains the whole number (non-normalized) counts of each feature in each sample.

For now, the validation step defaults to requiring all values be counts (positive integers), but other type options include coverage (for working with MAGs or metagenomes), relative if you already have relative abundances and normalized if you have spike-ins or another method that determines the correct abundance in each sample.

First bit of example_feature_df
ASV 149_F1 149_F2 149_F3 149_F4 149_F5
ASV_1 1245 376 582 1258 692
ASV_2 1471 569 830 1373 737
ASV_3 342 152 211 389 218
ASV_4 288 119 161 294 157
ASV_5 317 108 95 292 164
ASV_6 201 73 130 250 112
feature_object <- qsip_feature_data(example_feature_df,
  feature_id = "ASV"
)

class(feature_object)
#> [1] "qsip_feature_data" "S7_object"

See vignette("feature_data") for more details.

The qsip_data Object

The qsip_data class is the main workhorse object in the qSIP2 package. It is built from validated versions of the three previous objects, and is meant to be a self-contained object with all of the necessary information for analysis.

qsip_object <- qsip_data(
  source_data = source_object,
  sample_data = sample_object,
  feature_data = feature_object
)
#> There are 15 source_mat_ids, and they are all shared between the source and sample objects
#> There are 284 sample_ids, and they are all shared between the sample and feature objects

class(qsip_object)
#> [1] "qsip_data" "S7_object"

This function will report if all source_mat_ids are shared between the source and sample data, and if all sample_ids are shared between the sample and feature data. If it reports there are some unshared ids, you can access them with show_unshared_ids(qsip_object), but note that it is just a warning and does not stop the creation of the qSIP object.

Visualizations

Behind the scenes, creation of this object also runs some other calculations, particularly getting the weighted-average density (WAD) for each feature in each source, and also the tube relative abundance of each feature. With these, certain visualizations can be made with built-in functions.

plot_source_wads(qsip_object, 
                 group = "Moisture",
                 title = "WAD of each source, grouped by Moisture")

plot_sample_curves(qsip_object,
                   colors = c("12C" = "orange", "13C" = "magenta"),
                   title = "Normalized density curves for each sample")

Another sanity check is making sure the reported density values are on a reasonably straight line with the gradient position. This can be plotted, and then a Cook’s distance can be calculated to highlight any outliers. Note, the ends of the gradient are often flagged as outliers, although this may not necessarily be the case.

The design of the qsip_data object is that it is contains “slots” for each new analysis step. Although you could create a new object for each step of the workflow, you can assign the output of each step back to the original object in order to keep everything together.

Main Workflow

Now that we have a validated qsip_data object, we can start the main workflow consisting of comparison grouping, filtering, resampling and finally calculating EAF values.

Comparison Grouping

Your qsip_data object likely contains all of your data, but you may only want to run comparisons on certain subsets. The show_comparison_groups() function attempts to identify and suggest the sources you may want to compare.

show_comparison_groups(qsip_object,
  group = "Moisture",
  isotope = "isotope",
  source_mat_id = "source_mat_id"
)
Output of show_comparison_groups()
Moisture 12C 13C
Normal S149, S150, S151, S152 S178, S179, S180
Drought S161, S162, S163, S164 S200, S201, S202, S203

The group argument here is the most important as it will define the rows that it thinks constitute a comparison. The isotope argument is what defines the labeled and unlabeled values for the comparisons. This can be more complex, particularly if you have more than one isotopolog, and details will be made available in the future in another vignette.

The first row shows what “Normal” moisture groups we likely want to use for unlabeled (S149, S150, S151 and S152) to compare to the labeled (S178, S179 and S180). Sometimes you may also want to compare the specific labeled samples in a group to all unlabeled. The qSIP2 package has a convenient way to get those by using the get_all_by_isotope() function.

get_all_by_isotope(qsip_object, "12C")
#> [1] "S149" "S150" "S151" "S152" "S161" "S162" "S163" "S164"

The show_comparison_groups() function is entirely informational. There is a more automated way to run the qSIP workflow using the run_comparison_groups() function, and it will be detailed in another vignette.

Filter Features

The filter features step does two things. First, it is where the set of labeled and unlabeled sources are defined for a specific comparison. Second, it is where you can explicitly say how prevalent a feature must be to be considered “present” in a source. In other words, you define the parameters that a feature must be found enough of the replicate sources, and in enough samples that you can calculate an accurate WAD value for it.

The run_feature_filter() function takes a qsip_data object and these different parameters allowing you to precisely tailor your filtering results. The more strict the filtering, the fewer features that will pass the filter.

qsip_normal <- run_feature_filter(qsip_object,
  unlabeled_source_mat_ids = get_all_by_isotope(qsip_object, "12C"),
  labeled_source_mat_ids = c("S178", "S179", "S180"),
  min_unlabeled_sources = 6,
  min_labeled_sources = 3,
  min_unlabeled_fractions = 6,
  min_labeled_fractions = 6
)
#> There are initially 2030 unique feature_ids
#> 1705 of these have abundance in at least one fraction of one source_mat_id
#> =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
#> Filtering feature_ids by fraction...
#> 1519 unlabeled and 1417 labeled feature_ids were found in zero fractions in at least one source_mat_id
#> 1440 unlabeled and 830 labeled feature_ids were found in too few fractions in at least one source_mat_id
#> 299 unlabeled and 209 labeled feature_ids passed the fraction filter
#> In total, 346 unique feature_ids passed the fraction filtering requirements...
#> =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
#> Filtering feature_ids by source...
#> 47 unlabeled and 137 labeled feature_ids failed the source filter because they were found in zero sources
#> 196 unlabeled and 127 labeled feature_ids failed the source filter because they were found in too few sources
#> 103 unlabeled and 82 labeled feature_ids passed the source filter
#> =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
#> In total, 64 unique feature_ids passed all fraction and source filtering requirements

Note, although I said earlier you can overwrite your qsip_data objects as you go, here it might make sense to create two versions for the moisture treatments. We’ll take the original qsip_object and save the filtered Normal dataset to qsip_normal', and the Drought toqsip_drought`.

Of the 1,705 features found in the “Normal” data, we can see our rather strict filtering removed all but 64 features from the dataset. We can visualize these results on a per-source basis with the plot_filter_gradient_position() function.

Although a large number of features were removed, we can tell that the 64 that remained actually still make up a large proportion of the total abundance in each sample. In A above, the retained features (in blue) make up ~75-85% of the total data, while the removed data (orange) is the remaining ~15-25%.

In B, we can see that a surprisingly large number of features are found 0 times in many sources (gray) and will therefore never be present regardless of our filtering choices. And although there are are ~100-200 features that passed the filtering requirements (blue), our requirement that min_unlabeled_sources = 6 and min_labeled_sources = 3 means that only the features present in many of the blue slices will be retained, leaving only 64 total.

Let’s do the same comparison with the drought samples.

qsip_drought <- run_feature_filter(qsip_object,
  unlabeled_source_mat_ids = get_all_by_isotope(qsip_object, "12C"),
  labeled_source_mat_ids = c("S200", "S201", "S202", "S203"),
  min_unlabeled_sources = 6,
  min_labeled_sources = 3,
  min_unlabeled_fractions = 6,
  min_labeled_fractions = 6
)
#> There are initially 2030 unique feature_ids
#> 1877 of these have abundance in at least one fraction of one source_mat_id
#> =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
#> Filtering feature_ids by fraction...
#> 1691 unlabeled and 1558 labeled feature_ids were found in zero fractions in at least one source_mat_id
#> 1440 unlabeled and 1212 labeled feature_ids were found in too few fractions in at least one source_mat_id
#> 299 unlabeled and 285 labeled feature_ids passed the fraction filter
#> In total, 367 unique feature_ids passed the fraction filtering requirements...
#> =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
#> Filtering feature_ids by source...
#> 68 unlabeled and 82 labeled feature_ids failed the source filter because they were found in zero sources
#> 196 unlabeled and 171 labeled feature_ids failed the source filter because they were found in too few sources
#> 103 unlabeled and 114 labeled feature_ids passed the source filter
#> =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
#> In total, 89 unique feature_ids passed all fraction and source filtering requirements

And only 89 features were retained in the Drought dataset.

Strictly speaking, there is no requirement to do strict filtering, and it is possible to execute run_feature_filter() with all parameters set to 1. Filtering was originally recommended because it dramatically sped up the compute time, but running all vs. a subset of features in qSIP2 has a minimal impact. Further, setting all values to 1 and utilizing the allow_failures flag in the resampling step can provide an alternative way of deciding which features to keep, rather than just their prevalence amongst sources/samples. More information is provided in the Resampling section below, or the vignette("resampling") vignette.

Importantly, since the relative abundances have already been calculated for each feature, the subsequent steps in the qSIP pipeline keep each feature’s computations independent, and the results for a specific feature will be identical regardless of whether other features are included or filtered out.

Resampling

The weighted average density (WAD) values were automatically calculated during the creation of the qsip_data object earlier. In order to calculate the confidence interval for the EAF values, we first need to run a resampling/bootstrapping procedure on the WAD values. For example, a feature found in 6 unlabeled sources will have 6 WAD values, and these 6 WAD values are resampled many times to obtain bootstrapped mean WAD values.

qsip_normal <- run_resampling(qsip_normal,
  resamples = 1000,
  with_seed = 17,
  progress = FALSE
)

As this step requires some random sampling it is good practice to set the “seed”. Rather than doing this outside of the function, you can pass the seed as an argument. If you leave blank, it will generate a random seed. The seed will generate the same results each time you run the resampling process.

qsip_normal_17_again <- run_resampling(qsip_normal,
  resamples = 1000,
  with_seed = 17,
  progress = FALSE
)

# two runs are identical
identical(qsip_normal, qsip_normal_17_again)
#> [1] TRUE
identical(qsip_normal@resamples$l[[334]], qsip_normal_17_again@resamples$l[[334]])
#> [1] TRUE

# but individual resamplings within are different
identical(qsip_normal@resamples$l[[1]], qsip_normal@resamples$l[[2]])
#> [1] FALSE

Ressampling of the drought dataset. Notice at this step we are overwriting the original qsip_drought with the results of run_resampling(), rather than creating new objects.

qsip_drought <- run_resampling(qsip_drought,
  resamples = 1000,
  with_seed = 17,
  progress = FALSE
)

It is possible to get a resampling error if your filtering is too strict. If so, consult the vignette("resampling") vignette and consider running with allow_failures = T.

EAF Calculations

And we are finally at the last main step, calculating and summarizing the excess atom fraction (EAF) values. There are two functions to run, the first (run_EAF_calculations()) that calculate EAF for the observed data and all resamplings, and the second (summarize_EAF_values()) that summarizes that data at a chosen confidence interval. These are split into two functions simply because the run_EAF_calculations() can take a little longer, allowing different parameters to be tried in summarize_EAF_values() without having to recalculate everything.

We’ll also mutate() to add the original Moisture condition to each dataframe before we combine them.

qsip_normal <- run_EAF_calculations(qsip_normal)
qsip_drought <- run_EAF_calculations(qsip_drought)

normal <- summarize_EAF_values(qsip_normal, confidence = 0.9) |>
  mutate(Moisture = "Normal")
#> Confidence level = 0.9

drought <- summarize_EAF_values(qsip_drought, confidence = 0.9) |>
  mutate(Moisture = "Drought")
#> Confidence level = 0.9

eaf <- rbind(normal, drought)

We can plot the top 50 by each moisture condition.

plot_EAF_values(qsip_normal, 
                top = 50, 
                confidence = 0.9, 
                error = "ribbon",
                title = "Normal moisture")
#> Confidence level = 0.9

plot_EAF_values(qsip_drought, 
                top = 50, 
                confidence = 0.9, 
                error = "ribbon",
                title = "Drought moisture")
#> Confidence level = 0.9

The plot_feature_curves() function allows us to plot the tube relative abundances for specific feature IDs. Let’s look at two with high EAF values, and two with low values.

plot_feature_curves(qsip_normal,
  feature_ids = c("ASV_55", "ASV_84", "ASV_41", "ASV_44"),
  title = "Density curves of selected features"
)

Working with multiple qSIP objects (preview)

It is possible to work with multiple qsip_data objects if they are in a list. This will be detailed in a forthcoming vignette, but here is a sneak peak where we can use the existing summarize_EAF_values() or plot_EAF_values() functions.

qsip_list = list("Normal" = qsip_normal,
                 "Drought" = qsip_drought)
#> Confidence level = 0.9
group feature_id observed_EAF mean_resampled_EAF lower upper labeled_resamples unlabeled_resamples labeled_sources unlabeled_sources
Normal ASV_1 -0.0153107 -0.0157044 -0.0516543 0.0236518 1000 1000 3 8
Drought ASV_1 -0.0333856 -0.0330890 -0.0808509 0.0161212 1000 1000 4 8
Normal ASV_10 0.1126260 0.1121874 0.0848992 0.1400368 1000 1000 3 8
Drought ASV_10 0.0543136 0.0541364 0.0303215 0.0776436 1000 1000 4 8
Drought ASV_100 -0.0892684 -0.0895131 -0.1488307 -0.0349891 1000 1000 3 7
Drought ASV_102 -0.0407091 -0.0407609 -0.0907747 0.0088904 1000 1000 4 8
Normal ASV_11 0.3749260 0.3743849 0.3392976 0.4094196 1000 1000 3 8
Drought ASV_11 0.2629470 0.2622983 0.2041474 0.3099201 1000 1000 4 8
Normal ASV_114 0.1926455 0.1918100 0.1234247 0.2683575 1000 1000 3 7
Drought ASV_114 0.2160639 0.2167657 0.1304984 0.2998898 1000 1000 3 7
plot_EAF_values(qsip_list, 
                top = 50,
                confidence = 0.9, 
                error = "bar")
#> Confidence level = 0.9

Piped Workflow

Although the workflow is often easier to understand and troubleshoot when broken into individual steps, it is possible to run the entire workflow in a single pipe.

For example, the normal moisture data could be filtered, resampled and have EAF values calculated in a single pipe.

# example code, not executed. Default values are not shown.

qsip_normal <- run_feature_filter(qsip_object,
  unlabeled_source_mat_ids = get_all_by_isotope(qsip_object, "12C"),
  labeled_source_mat_ids = c("S178", "S179", "S180"),
  min_unlabeled_sources = 6,
  min_labeled_sources = 3
) |>
  run_resampling() |>
  run_EAF_calculations()