Sample Data

library(dplyr)
library(qSIP2)
packageVersion("qSIP2")
#> [1] '0.17.8'

Samples and Metadata

In qSIP2, “sample data” refers to any metadata associated with the individual fractions or their sequencing data. This is the second level of metadata - more granular than “source data”, but at a higher-level than the “feature data”.

The word sample typically refers to the biological or environmental entity the DNA was isolated (aka the source material). In qSIP, however, because there are multiple sequencing runs per source, the term sample has historically been reserved for sequencing of each fraction. In practice, this means you will have many samples for each single source. To keep with the proposed MISIP¹ standards, the source material is coded with the source_mat_id, and each sequencing run/fraction is coded with the sample_id.

The sample data has the most requirements to pass validation of the three qSIP2 input types. To standardize these requirements the column names should adhere as closely to MISIP standards where possible. Absolute requirements are slightly different than the official MISIP standards, and for the qSIP2 package would be the source_mat_id, sample_id, gradient_position, gradient_pos_density and gradient_pos_rel_amt fields.

`source_mat_id` and `sample_id` columns

One column of the metadata should contain a unique sample_id that is associated with a source_mat_id that matches one row on the qsip_source_data dataframe. Although the sample_id should be unique for each row, it is expected that the source_mat_id will be duplicated creating a many-to-one relationship. The format of the sample_id doesn’t matter, as long as it is unique. A popular choice might be the source_mat_id appended with the gradient position.

`gradient_position` and `gradient_pos_density` columns

These are the required columns to describe the position and density values for each sample_id. gradient_position is typically numbered by decreasing density, so the heaviest will be 1, second heaviest will be 2, etc. To pass qSIP2 validation the gradient_position must be a positive integer, but -1 will also be accepted to designate a “bulk” or other “unfractionated” sample (e.g. when doing growth analysis). gradient_position is not used directly in qSIP2 calculations but can be useful for quality control. gradient_pos_density is reported as g/ml with as much decimal place precision as your method allows.

`gradient_pos_amt` and `gradient_pos_rel_amt` columns

A requirement for qSIP is the gradient_pos_rel_amt field, which gives the percent amount that a fraction has of the whole. The preferred method is given in qPCR copy numbers per volume of source material, but DNA totals/concentrations can be used as well.

The gradient_pos_rel_amt column is required for MISIP standards, but for qSIP2 we also require an additional gradient_pos_amt column with the “raw” abundance values for each sample. This requirement is because typically a qSIP workflow will produce the gradient_pos_amt values, and the gradient_pos_rel_amt can be derived from it using the add_gradient_pos_rel_amt() function. This function simply totals up the gradient_pos_amt for each sample in a source, and then divides by the total. This math is the same whether using qPCR amplicon copies or total DNA isolated.

For example, if there are 100,000 total 16S copies in a source as determined by qPCR, and 15,000 copies in fraction 7, then the gradient_pos_rel_amt value for fraction 7 would be 0.15 (15,000 / 100,000). Similarly, if you had 25ng total DNA used for density separation, and fraction 7 had 3.75 ng DNA recovered, then gradient_pos_rel_amt would also be 0.15 (3.75 / 25).

See below for cases where it might make sense for the totals to not add up to 100%, and how to work with those cases in qSIP2.

An example sample dataframe is included in the qSIP2 package to demonstrate the possible columns.

example_sample_df

First few rows of formatted sample data
sample	source	Fraction	density_g_ml	dna_conc	avg_16S_g_soil
149_F1	S149	1	1.778855	0.0000000	4473.7081
149_F2	S149	2	1.773391	0.0000000	986.6581
149_F3	S149	3	1.765742	0.0000000	4002.7026
149_F4	S149	4	1.759185	0.0000000	3959.7283
149_F5	S149	5	1.752629	0.0012413	5725.7319
149_F6	S149	6	1.746072	0.0128156	7566.2722

qSIP2 Sample Data Object

The qsip_sample_data() constructor builds a sample data object from the dataframe. Like the source data object, the sample data object is built by providing column names to the appropriate parameters.

sample_object <- qsip_sample_data(example_sample_df,
  sample_id = "sample",
  source_mat_id = "source",
  gradient_position = "Fraction",
  gradient_pos_density = "density_g_ml",
  gradient_pos_amt = "avg_16S_g_soil"
)
#> <gradient_pos_rel_amt> not specified. Calculating using avg_16S_g_soil column

Notice our dataframe did not contain the required gradient_pos_rel_amt column, but the function was able to calculate it using the data in the gradient_pos_amt column. See below for reasons why you might want to pre-calculate this value rather than having the function calculate it for you.

Structure of `qsip_sample_data`

Like other qSIP2 objects, the qsip_sample_data object is a list with a few key components, but not meant to be inspected directly.

glimpse(sample_object)
#> <qsip_sample_data>
#>  @ data                : tibble [284 × 7] (S3: tbl_df/tbl/data.frame)
#>  $ sample_id           : chr [1:284] "149_F1" "149_F2" "149_F3" "149_F4" ...
#>  $ source_mat_id       : chr [1:284] "S149" "S149" "S149" "S149" ...
#>  $ gradient_position   : int [1:284] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ gradient_pos_density: num [1:284] 1.78 1.77 1.77 1.76 1.75 ...
#>  $ gradient_pos_amt    : num [1:284] 4474 987 4003 3960 5726 ...
#>  $ gradient_pos_rel_amt: num [1:284] 1.28e-04 2.83e-05 1.15e-04 1.14e-04 1.64e-04 ...
#>  $ dna_conc            : num [1:284] 0 0 0 0 0.00124 ...
#>  @ sample_id           : chr "sample"
#>  @ source_mat_id       : chr "source"
#>  @ gradient_position   : chr "Fraction"
#>  @ gradient_pos_density: chr "density_g_ml"
#>  @ gradient_pos_amt    : chr "avg_16S_g_soil"
#>  @ gradient_pos_rel_amt: chr "gradient_pos_rel_amt"

The dataframe can be returned from the object using the get_dataframe() function.

get_dataframe(sample_object)

First few rows of formatted sample data
sample_id	source_mat_id	gradient_position	gradient_pos_density	gradient_pos_amt	gradient_pos_rel_amt	dna_conc
149_F1	S149	1	1.778855	4473.7081	0.0001284	0.0000000
149_F2	S149	2	1.773391	986.6581	0.0000283	0.0000000
149_F3	S149	3	1.765742	4002.7026	0.0001149	0.0000000
149_F4	S149	4	1.759185	3959.7283	0.0001137	0.0000000
149_F5	S149	5	1.752629	5725.7319	0.0001643	0.0012413
149_F6	S149	6	1.746072	7566.2722	0.0002172	0.0128156

Validation of `qsip_sample_data`

A valid qsip_sample_data object will have the required columns, sample_id will contain unique values, and gradient_position will be positive integers or -1.

Additionally, the density_g_ml values should be in a reasonable range (between 1.55 and 1.8)

# modifying density_g_ml to be too low will give an error
example_sample_df |>
  mutate(density_g_ml = density_g_ml / 2) |>
  qsip_sample_data(
    sample_id = "sample",
    source_mat_id = "source",
    gradient_position = "Fraction",
    gradient_pos_density = "density_g_ml",
    gradient_pos_amt = "avg_16S_g_soil"
  )
#> <gradient_pos_rel_amt> not specified. Calculating using avg_16S_g_soil column
#> Error in qSIP2::validate_gradient_pos_density(dplyr::select(self@data, : some gradient_pos_density values are lower than 1.55

When to calculate the `gradient_pos_rel_amt` values?

Typically all of the gradient_pos_rel_amt for a given source_mat_id should add up to 1 (i.e. 100%), but there are situations where it might be less than 1 (although never greater than). For example, if you removed some fractions because they didn’t sequence well or there was some other reason to remove a fraction. If those removed fractions were 5% of the total data in that source_mat_id, then you would expect the total of all fractions to be 0.95 rather than 1. Another situation would be that although you added 25ng to a centrifugation, you only recovered 20ng and the rest was lost².

This matters because if you run the add_gradient_pos_rel_amt() function on your dataframe then the totals will add up to 1. You can then remove the rows from the dataframe that you don’t want to include in the analysis, and sum of the gradient_pos_rel_amt values will equal the appropriately adjusted amount. But, if you remove the bad rows from the dataframe first, and then let the qsip_sample_data() function create the gradient_pos_rel_amt values, then they will be artificially higher than they should be.

You can use a total abundance (qPCR or DNA concentrations) to calculate the gradient_pos_rel_amt column.

add_gradient_pos_rel_amt(example_sample_df,
  source_mat_id = "source",
  amt = "avg_16S_g_soil"
)

sample data with `gradient_pos_rel_amt` added
sample	source	Fraction	density_g_ml	dna_conc	avg_16S_g_soil	gradient_pos_rel_amt
149_F1	S149	1	1.778855	0.0000000	4473.7081	0.0001284
149_F2	S149	2	1.773391	0.0000000	986.6581	0.0000283
149_F3	S149	3	1.765742	0.0000000	4002.7026	0.0001149
149_F4	S149	4	1.759185	0.0000000	3959.7283	0.0001137
149_F5	S149	5	1.752629	0.0012413	5725.7319	0.0001643
149_F6	S149	6	1.746072	0.0128156	7566.2722	0.0002172

Trying to run this function on a dataframe with an existing gradient_pos_rel_amt will give an error, but it can be overridden with the overwrite = T flag.

add_gradient_pos_rel_amt(example_sample_df,
  source_mat_id = "source",
  amt = "avg_16S_g_soil"
) |>
  add_gradient_pos_rel_amt(
    source_mat_id = "source",
    amt = "avg_16S_g_soil"
  )
#> Error: gradient_pos_rel_amt already exists! Set overwrite = TRUE if you want to overwrite

# set overwrite = TRUE to override the error, although this is a silly example here
add_gradient_pos_rel_amt(example_sample_df,
  source_mat_id = "source",
  amt = "avg_16S_g_soil"
) |>
  add_gradient_pos_rel_amt(
    source_mat_id = "source",
    amt = "avg_16S_g_soil",
    overwrite = TRUE
  )

Samples and Metadata

source_mat_id and sample_id columns

gradient_position and gradient_pos_density columns

gradient_pos_amt and gradient_pos_rel_amt columns