library(dplyr)
library(tibble)
library(tidyr)
library(qSIP2)
packageVersion("qSIP2")
#> [1] '0.17.8'
Feature Counts and Metadata
A feature table is a required file for the qSIP2
pipeline. It is a typical ASV/OTU table where individual taxa are in
rows, and sample names are in columns. The table is populated with raw
sequencing counts from an amplicon workflow, or some other proxy for
abundance (like mean/median depth of coverage) if working with MAGs,
contigs or other data types.
A “feature” refers to the names of your individual sequenced units
(amplicons, taxa, vOTUs, MAGs, etc.) The feature data should be in a
dataframe with a column designated as the feature_id
. If
you have a dataframe with rownames
you can convert that to
a column using the tibble::rownames_to_column()
function.
df_with_rownames <- data.frame(
row.names = c("feature1", "feature2", "feature3"),
sample1 = c(1, 2, 3),
sample2 = c(4, 5, 6)
)
# data has rownames
rownames(df_with_rownames)
#> [1] "feature1" "feature2" "feature3"
df_with_rownames
#> sample1 sample2
#> feature1 1 4
#> feature2 2 5
#> feature3 3 6
# convert rownames to their own column
df_with_rownames |>
tibble::rownames_to_column(var = "feature_id")
#> feature_id sample1 sample2
#> 1 feature1 1 4
#> 2 feature2 2 5
#> 3 feature3 3 6
Each row corresponds to a feature_id
, and the abundance
of that feature in a certain sample lives in a column with that
sample_id
as the column header. The abundance values
themselves can be one of several types, and the values are subject to
different validation requirements based on the type
of the
data. Currently, the accepted types are counts
,
coverage
, normalized
, and
relative
.
-
counts
is the default and what should be used in most cases where you are giving raw sequencing counts. Here, the data is expected to be integers equal to or greater than 0. -
coverage
is designed for use with MAGs or other data types where you are using a proxy for abundance like mean/median depth of coverage. Here, the data is expected to be numeric values equal to or greater than 0. -
relative
is for situations where you might have lost the original integer count data and only have relative abundance. If using this option it expects fractional abundances (rather than percentages) so each column must sum to 1 or less. -
normalized
is a special case where you have already pre-normalized your counts using an internal spike-in or similar.
qSIP2 Feature Data Object
The qsip_feature_data()
function creates the
qsip_feature_data
object. A qSIP_feature_data
object holds validated abundance data for your features. It can be made
by giving an already made dataframe, or by modifying a dataframe and
piping directly into the function. There is an example dataframe in the
qSIP2
package called example_feature_data
.
feature_data = qsip_feature_data(example_feature_df,
feature_id = "ASV",
type = "counts")
Structure of qsip_feature_data
Like the other qSIP2
objects, the
qsip_feature_data
object contains a @data
slot
to hold the feature table, but it isn’t intended to be worked with
directly. The @type
slot holds the type of data, and the
@feature_id
slot holds the name of the column with the
feature ids. There is an additional slot for the taxonomy data, if you
have it (see below).
You can return the original dataframe with the
get_dataframe()
method.
# not run
get_dataframe(feature_data)
Validation of qsip_feature_data
Most of the validation checks depend on the chosen type
.
If you try to pass values that don’t match the type you specified, you
will get an error. For example, fractional values are not allowed when
the type is the default counts
.
tibble(
feature_id = c("feature1", "feature2", "feature3"),
sample1 = c(0.1, 0.2, 0.3),
sample2 = c(0.4, 0.5, 0.6)
) |>
qsip_feature_data()
#> Error: Some data are not integers
But it is allowed with type = "coverage"
.
tibble(
feature_id = c("feature1", "feature2", "feature3"),
sample1 = c(0.1, 0.2, 0.3),
sample2 = c(0.4, 0.5, 0.6)
) |>
qsip_feature_data(type = "coverage")
#> <qSIP2::qsip_feature_data>
#> feature_id count: 3
#> sample_id count: 2
#> data type: coverage
In theory, the slots can be overwritten, but this is not recommended. If you do, they will undergo the same validations and may fail.
feature_data@type <- "relative"
#> Error: Some columns have a total relative abundance sum greater than 1
Special Considerations
NA Values
If you have NA
values in your abundance table, you will
get an error when trying to make the feature object. In most cases a
value of NA
means an abundance of 0, and so the best
practice would be to convert prior to creating the object with a
mutate
call and the across
function from the
tidyr
package.
tibble(
feature_id = c("feature1", "feature2", "feature3"),
sample1 = c(1, 2, NA),
sample2 = c(4, 5, 6)
) |>
mutate(across(everything(), ~ replace_na(.x, 0))) |>
qsip_feature_data()
#> <qSIP2::qsip_feature_data>
#> feature_id count: 3
#> sample_id count: 2
#> data type: counts
Taxonomy or other metadata
If you have further metadata for your features, such as a taxonomy
table, you can add it with the add_taxonomy()
function and
it will live in the @taxonomy
slot.
taxonomy <- tibble(
feature_id = c("feature1", "feature2", "feature3"),
genus = c("Marinobacter", "Devosia", "Pseudomonas"),
species = c("adhaerens", "insulae", "syringae")
)
tibble(
feature_id = c("feature1", "feature2", "feature3"),
sample1 = c(1, 2, 3),
sample2 = c(4, 5, 6)
) |>
qsip_feature_data() |>
add_taxonomy(taxonomy, feature_id = "feature_id")
#> <qSIP2::qsip_feature_data>
#> feature_id count: 3
#> sample_id count: 2
#> data type: counts