Merging data from different runs/ plates in manifest step or after

Steve · March 25, 2021, 9:40pm

Good day,

I had samples processed over 2 MiSeq runs, where all the amplicons (515F-806R) were pooled into one library and sequenced twice. I intend on merging the data sets for analysis.

Now, I have read through the two threads:

But still, have questions:

So far, my methodology has included creating a manifest to join all the paired end reads.
Due to sequencing each sample on two runs, I decided to provide each its own unique sample ID.
For example, from the first sequencing run, "Site 1" will have Forward and Reverse reads, and the second sequencing run is titled "Site 1b" as its sample ID with its Forward and Reverse reads.

Since I am using only one manifest, all denoised/deblurred samples will be compiled into one output of rep-seq.qza and table.qza files once completed.

Is there a way of combining "Site 1" and "Site 1b" at this time? If so what commands are suggested?

I noticed that the "Fecal microbiota transplant" tutorial has the following code:

qiime feature-table merge
--i-tables table-1.qza
--i-tables table-2.qza
--o-merged-table table.qza
qiime feature-table merge-seqs
--i-data rep-seqs-1.qza
--i-data rep-seqs-2.qza
--o-merged-data rep-seqs.qza

However, my interpretation of its usage from the two mentioned forum discussions above would be:

You have to make two manifests, where each contain only one set of reads from either plate, where each set of "Site 1" is labelled identically.
(both called "Site 1" in either manifest, and not "Site 1" and "Site1b" in a single manifest)

You then run the paired end joining as well as the denoising/ debluring steps SEPARATELY, one using the manifest of one run, the second with the other.

THEN, you can combine the table outputs with the above command, and continue with your analysis.

Is this interpretation correct? Or is there a way to combine two samples with unique ID's after they have been placed in the same table/ rep-seq files using one manifest?

Thank you kindly,
I greatly appreciate the use of your software, as well as your well curated tutorials and forums.
-Steven

llenzi · March 26, 2021, 9:57am

Hi @Steve,
welcome on the forum!

To answer your question, you are right on saying that the usual practice is to denoise separately each run and then merge the data as you have seen in the tutorial. The reason is that different runs may apply slightly different biases to the sequences, and denoise them separately should compensate for this to batch effects (as long as you keep the same denoising settings). (Note, batch effects may be created if you use different reagent kits/lots for extraction, PCR and so on, but I will assume you using the same of everything for sake of simplicity).

In your case, you have the same pool of library run twice, so I would say that all biases are applied equally to all the samples. I would check this hypothesis by considering the two runs as technical replicates, by performing a quick beta-diversity analysis. If there is no evident biases between the two, you can then merge the samples as in your initial point by using:

https://docs.qiime2.org/2021.2/plugins/available/feature-table/group/

For this, you will need a metadata file column which correlate each sample to the final sample, eg
sampleid Group
Sample_1 Sample_1
Sample_1b Sample_1
Sample_2 Sample_2
Sample_2b Sample_2

The plug in will produce a new table in which the samples will be Sample_1 and Sample_2 in my example. You will need to create a new metadata file using Sample_1 and Sample_2 as identifier.

Hope it helps,
Luca

Steve · April 6, 2021, 7:25pm

Good day,

In your example of merging samples, I believe the document sited can only merge tables, not rep-seqs.

How can I merge rep-seqs with your method?

SoilRotifer · April 7, 2021, 1:51pm

If I understand your question correctly, there is no need to further combine the rep-seqs.qza by group. The presence / absence of the sequences/features will not change due merging of the samples in the table by group. All that matters is that there is a mapping of the same feature-ids in the table.qza and the rep-seqs.qza.

Steve · April 7, 2021, 3:53pm

Good morning,

I guess I should rephrase to better explain what I am trying to do.

My samples were sequenced twice (over two identical MiSeq runs) in order to achieve higher sequencing depth per sample. My goal is to combine the raw sequences as early as possible in the Qiime2 workflow, and not to treat the two runs as two different samples throughout.

I do not want the separation of the raw files to affect the trimming, binning, or other processes.

I have attempted the method mentioned in the "Fecal microbiota transplant" tutorial as described in my original post, but was unsuccessful as I received errors when trying to merge two tables with samples that have the same names.

The method mentioned by Illenzi will work for merging tables, but not rep-seqs as far as I am aware.

The reason rep-seqs would need to be merged, would be to perform the following steps on the merged samples:

Denovo Clustering
Closed Reference Clustering
Open Reference Clustering
Chimera checking
BorderLine Chimera Checking
Barplots
etc.

Each of these steps require a single input of rep-seqs.qza and table.qza files (later steps require modified versions from the previous rep-seq files). Or are you saying these steps can have several input rep-seq files, while using a single combined table.qza file?

Or am I to understand that there is only a system in place to combine samples at a later stage and not earlier?

At the end of the day, I need these samples to be combined for the final output of .biom files, stacked bar charts, alpha, and beta diversities etc.

It is my understanding that the sooner in the process the samples are combined, the better the trimming and binning will be. But maybe you have a better interpretation than I.

Thank you kindly for the help, and I apologize if I am not stating my query clearly. If it would help, Id be happy to discuss with someone over the phone, please DM me if that is preferred.

Sincerely,

-Steven

SoilRotifer · April 7, 2021, 4:11pm

Not a problem. I think you've been quite clear.

Using the approach Illenzi mentioned will work. Again, all that is occurring here is the summation of sequence (i.e. feature) counts in the table by merging the separate samples. The presence of the features in the sequence file will not change. You can continue with the same rep-seqs.qza file as before. The rep-seqs file is agnostic with sample information, it is simply a file that contains sequences labeled by their feature id, which will match those in the table. Does this make sense?

As for:

Not sure where this came from, but this won't be a problem. What you've described is quite common (i.e. sequencing several replicates of a given biological sample to obtain more data). As pointed out earlier, if you would like to use DADA2 you must denoise on a per-run basis due to differences in per-run error profiles, then merge these data as you've noted (regardless if the same sample was resequenced on different runs). Alternatively, you can simply run deblur on all your data simultaneously. Simply merging your replicate samples at a later stage is fine.

llenzi · April 8, 2021, 8:42am

Hi @Steve,

I agree with @SoilRotifer, however if you want to try to merge the sequences you can use: merge-seqs: Combine collections of feature sequences — QIIME 2 2021.2.0 documentation

It should return a non-redundant set of sequences including sequences form both runs (but keep in mind the sequences present in one run and not the other should be really few in number and probably due to contaminant mostly )

For the progress of the analysis, dada2 already perform a chimera checking so no need of them usually!
Also, dada2's ASVs are already 'clustered' (you may see them as clusters at 100% similarity performed after the denoising step), so you may choose to perform further clustering but it is not strictly necessary! After dada2, you are good to go to taxonomy assignment (after merging in your case) and barplots charting, and so on.

As @SoilRotifer pointed out, denoising with deblur is another good option because it has the capacity to deal with sequences from different runs, so you can denoise alltogether. However, you will still need to use different samples ids (sample1a, sample1b and so on), then performing the grouping by using:
https://docs.qiime2.org/2021.2/plugins/available/feature-table/group/

For this you will need a metadata with a column associating sample1a and sample1b to sample1 and so on!

Hope it helps
Luca

system · May 9, 2021, 7:14pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.