Skip to contents

Background

In qSIP2, you can run the standard EAF workflow with multiple qsip_data objects at the same time. This vignette will show you different ways to create and use multiple qsip_data objects in the same workflow.

Multiple Object Structure

The format for storing multiple qsip_data objects is a simple named list, where the names are a descriptive but succinct name for that comparison, and the value is the qsip_data object. The list can be created in a few different ways, including “by hand”, by formatting the show_comparison_groups() output, or providing a dataframe to the dedicated run_comparison_groups() function. Each of these are detailed below.

Building the multi qsip_data object “by hand”

In the standard vignette("qSIP_workflow") vignette we filtered the example object to create “normal” and “drought” objects, and then proceeded through the workflow with these two objects independently. However, these two objects can instead be combined using list() to work with both simultaneously. We can run a validation function that will return TRUE if the list is correct.

qsip_normal <- run_feature_filter(example_qsip_object,
  unlabeled_source_mat_ids = get_all_by_isotope(example_qsip_object, "12C"),
  labeled_source_mat_ids = c("S178", "S179", "S180"),
  min_unlabeled_sources = 6, min_labeled_sources = 3,
  min_unlabeled_fractions = 5, min_labeled_fractions = 5,
  quiet = TRUE
) |>
  run_resampling(with_seed = 49, quiet = TRUE, progress = FALSE) |>
  run_EAF_calculations()

qsip_drought <- run_feature_filter(example_qsip_object,
  unlabeled_source_mat_ids = get_all_by_isotope(example_qsip_object, "12C"),
  labeled_source_mat_ids = c("S200", "S201", "S202", "S203"),
  min_unlabeled_sources = 6, min_labeled_sources = 3,
  min_unlabeled_fractions = 5, min_labeled_fractions = 5,
  quiet = TRUE
) |>
  run_resampling(with_seed = 50, quiet = TRUE, progress = FALSE) |>
  run_EAF_calculations()

qsip_list_1 = list("Normal" = qsip_normal, 
                          "Drought" = qsip_drought)

is_qsip_data_list(qsip_list_1)
#> [1] TRUE

The list objects now holds both qsip_data objects, but each can be accessed by name (e.g. qsip_list_1$Normal) if needed. When building by hand, the main benefit of combining multiple objects together is that you get shared reporting with summary functions like summarize_EAF_values() and plot_EAF_values().

#> Confidence level = 0.9
First few rows from summarize_EAF_values()
group feature_id observed_EAF mean_resampled_EAF lower upper labeled_resamples unlabeled_resamples labeled_sources unlabeled_sources
Normal ASV_1 -0.0153107 -0.0162090 -0.0524399 0.0239385 1000 1000 3 8
Drought ASV_1 -0.0333856 -0.0322146 -0.0792106 0.0219181 1000 1000 4 8
Normal ASV_10 0.1126260 0.1116218 0.0823812 0.1405383 1000 1000 3 8
Drought ASV_10 0.0543136 0.0542462 0.0305884 0.0788885 1000 1000 4 8
Drought ASV_100 -0.0892684 -0.0885048 -0.1485629 -0.0283228 1000 1000 3 7
Normal ASV_102 -0.0080214 -0.0090271 -0.0746514 0.0546485 1000 1000 3 8

In the table above, you can see that the feature_id column is repeated for each group.

plot_EAF_values(qsip_list_1,
                top = 50,
                error = "ribbon")
#> Confidence level = 0.9

Other helper functions can be used on the list as well.

n_resamples(qsip_list_1)
#> # A tibble: 2 × 2
#>   group   n_resamples
#>   <chr>         <dbl>
#> 1 Normal         1000
#> 2 Drought        1000
resample_seed(qsip_list_1)
#> # A tibble: 2 × 2
#>   group    seed
#>   <chr>   <dbl>
#> 1 Normal     49
#> 2 Drought    50

Dataframe-based multi qsip_data object

Although building the list by hand can be useful in organizing the output, the functionality of the list structure really shines when you start with a dataframe of comparisons you want to make. This workflow will do all of the steps from the initial, unfiltered qsip_data objects without the need to manually build each comparison object.

Dataframe Structure

The dataframe has a simple structure, and requires at a minimum three columns. The group column is the name of the comparison group, and the unlabeled and labeled columns are vectors of the source_material_ids to use in the comparison. Each row of the dataframe will be a separate comparison group.

Modifying show_comparison_groups() output

Recall that show_comparison_groups() attempts to guess the relevant comparison groups using the metadata. For the example qsip object, when grouping by “Moisture” we get

show_comparison_groups(example_qsip_object, group = "Moisture")
#> # A tibble: 2 × 3
#>   Moisture `12C`                  `13C`                 
#>   <chr>    <chr>                  <chr>                 
#> 1 Normal   S149, S150, S151, S152 S178, S179, S180      
#> 2 Drought  S161, S162, S163, S164 S200, S201, S202, S203

The output is already in our desired format, and all that is needed is some slight tweaking of the column names using dplyr::select() (or dplyr::rename()).

show_comparison_groups(example_qsip_object, group = "Moisture") |> 
  dplyr::select("group" = Moisture, "unlabeled" = "12C", "labeled" = "13C")
#> # A tibble: 2 × 3
#>   group   unlabeled              labeled               
#>   <chr>   <chr>                  <chr>                 
#> 1 Normal  S149, S150, S151, S152 S178, S179, S180      
#> 2 Drought S161, S162, S163, S164 S200, S201, S202, S203

This dataframe can be passed directly to run_comparison_groups() to create the multi qsip_data object. run_comparison_groups() requires the dataframe and a qsip_data object as the first two arguments. It has three additional optional arguments that globally apply to all comparisons: allow_failures (boolean), seed (integer), and resamples (integer).

qsip_list_2 = show_comparison_groups(example_qsip_object, group = "Moisture") |> 
  dplyr::select("group" = Moisture, "unlabeled" = "12C", "labeled" = "13C") |>
  run_comparison_groups(example_qsip_object, 
                        seed = 99,
                        allow_failures = TRUE)
#> Finished groups ■■■■■■■■■■■■■■■■                  50%
#> Finished groups ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100%
#> 

is_qsip_data_list(qsip_list_2)
#> [1] TRUE

Plotting the values here should give almost identical results (within sampling error) to the previous plot, but this time we have enabled the allow_failures option so we get resampling success as well.

plot_EAF_values(qsip_list_2,
                top = 50,
                success_ratio = 0.99,
                error = "ribbon")
#> Confidence level = 0.9

Custom dataframe for ultimate control

It is possible to make a dataframe from scratch (e.g. as an excel spreadsheet) to give even more fine-grain control over each comparison. Here, rather than setting parameters that will be treated as identical for all comparisons (like seed = 99 above), here additional columns can be used to set the values independently per row. An example dataframe is included in the qSIP2 package called example_group_dataframe. This dataframe is shown below, but with some columns temporarily removed for brevity (those columns will be discussed later).

example_group_dataframe
Example dataframe with some columns removed
group unlabeled labeled allow_failures resamples seed
Normal S149, S150, S151, S152 S178, S179, S180 T 500 100
Drought S161, S162, S163, S164 S200, S201, S202, S203 T 1000 101
Drought against all 12C S200, S201, S202, S203 T 1000 102
Normal_S178 S149, S150, S151, S152 S178 T 1000 103
Normal_S179 S149, S150, S151, S152 S179 T 1000 104
Normal_S180 S149, S150, S151, S152 S180 T 1000 105

This dataframe shows the customizability of the format, including retaining the ability to use isotope terms like “12C” to grab all source material IDs with that isotope.

qsip_list_3 = example_group_dataframe |>
  run_comparison_groups(example_qsip_object)
#> Finished groups ■■■■■■                            17%
#> Finished groups ■■■■■■■■■■■                       33%
#> Finished groups ■■■■■■■■■■■■■■■■                  50%
#> Finished groups ■■■■■■■■■■■■■■■■■■■■■             67%
#> Finished groups ■■■■■■■■■■■■■■■■■■■■■■■■■■        83%
#> Finished groups ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100%
#> 

resample_seed(qsip_list_3)
#> # A tibble: 6 × 2
#>   group                seed
#>   <chr>               <dbl>
#> 1 Drought               101
#> 2 Drought against all   102
#> 3 Normal                100
#> 4 Normal_S178           103
#> 5 Normal_S179           104
#> 6 Normal_S180           105

If you use the additional arguments in run_comparison_groups() then they will override the values in the dataframe.

qsip_list_4 = example_group_dataframe |>
  run_comparison_groups(example_qsip_object,
                        seed = 42)
#> Finished groups ■■■■■■                            17%
#> Finished groups ■■■■■■■■■■■                       33%
#> Finished groups ■■■■■■■■■■■■■■■■                  50%
#> Finished groups ■■■■■■■■■■■■■■■■■■■■■             67%
#> Finished groups ■■■■■■■■■■■■■■■■■■■■■■■■■■        83%
#> Finished groups ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  100%
#> 
resample_seed(qsip_list_4)
#> # A tibble: 6 × 2
#>   group                seed
#>   <chr>               <dbl>
#> 1 Drought                42
#> 2 Drought against all    42
#> 3 Normal                 42
#> 4 Normal_S178            42
#> 5 Normal_S179            42
#> 6 Normal_S180            42

The are 4 additional dataframe columns that run_comparison_groups() will accept: min_unlabeled_sources, min_labeled_sources, min_unlabeled_fractions, and min_labeled_fractions. These columns are used to set the minimum number of sources and fractions required for each comparison. The full example dataframe does additionally contain these columns.

summarize_EAF_values(qsip_list_3) |>
  filter(feature_id == "ASV_1")
#> Confidence level = 0.9
#> # A tibble: 6 × 10
#>   group              feature_id observed_EAF mean_resampled_EAF   lower    upper
#>   <chr>              <chr>             <dbl>              <dbl>   <dbl>    <dbl>
#> 1 Drought            ASV_1         -0.0492             -0.0489  -0.108   0.0113 
#> 2 Drought against a… ASV_1         -0.0334             -0.0328  -0.0798  0.0195 
#> 3 Normal             ASV_1          0.000455            0.00138 -0.0326  0.0357 
#> 4 Normal_S178        ASV_1          0.0410              0.0421   0.0214  0.0606 
#> 5 Normal_S179        ASV_1         -0.0256             -0.0251  -0.0454 -0.00603
#> 6 Normal_S180        ASV_1         -0.0140             -0.0141  -0.0378  0.00564
#> # ℹ 4 more variables: labeled_resamples <int>, unlabeled_resamples <int>,
#> #   labeled_sources <int>, unlabeled_sources <int>

But, keep in mind that each row of the dataframe will be filtered differently, and therefore some feature_ids could be missing from certain comparisons. For example, ASV_311 only appears in the two “Drought” comparisons.

summarize_EAF_values(qsip_list_3) |>
  filter(feature_id == "ASV_311")
#> Confidence level = 0.9
#> # A tibble: 2 × 10
#>   group feature_id observed_EAF mean_resampled_EAF lower upper labeled_resamples
#>   <chr> <chr>             <dbl>              <dbl> <dbl> <dbl>             <int>
#> 1 Drou… ASV_311           0.906              0.903 0.650 1.14               1000
#> 2 Drou… ASV_311           0.693              0.693 0.535 0.867              1000
#> # ℹ 3 more variables: unlabeled_resamples <int>, labeled_sources <int>,
#> #   unlabeled_sources <int>

And although you can plot all together, more than a few objects might make the plots harder to read. But, because it is a simple list you can always still access the individual objects if needed using the $ operator (e.g. qsip_list_3$Drought) or [] brackets (e.g. qsip_list_3[c("Drought", "Normal")]).

plot_EAF_values(qsip_list_3[c("Drought", "Normal")],
                error = "ribbon",
                top = 50)
#> Confidence level = 0.9