I have a few questions which are inter-related and partially addressed in several different threads but since commenting on those was closed I am asking them here. Thanks for your help and patience.
I have multiple (paired-end 250 bp) Miseq runs. In each run we have samples from multiple projects. Sometimes we repeat the entire experiment from a project and run them in a separate Miseq. Eventually we do meta-analysis on samples from the same project (that came from repeated experiments and sequenced in separate Miseq runs).
Do we need to use the manifest format for importing data? The Casava 1.8 paired end seems much more user friendly. Am I not seeing some obvious benefits/needs of the manifest format?
Let’s talk about one Miseq run containing samples from multiple projects. Is it ok to analyze them together (say using dada2), get the feature table and taxonomic assignments, then split the feature table per project, import the project specific feature table (or biom) and continue with diversity analysis?
While I plan to keep the read trim length fixed across projects and runs (as much as possible), considering I have paired-end data, is the read trim length a major issue?
Now talking about meta-analysis of samples from different Miseq runs. I plan to analyze each run separately, but Deblur suggests doing them together to choose a same read trim length. Based on your answers for 2) and 3), it may not matter much if we analyze per run [paired-end data + standard read trim length] or together [ability to analyze together -> split table per run -> import and continue]
If we keep using the same primers and sequencing read length, is it possible to train the classifier and keep using it for several projects?
Use whatever import format fits your data. If your reads are casava 1.8 format, use that format (see the file name pattern that you need to fit). The benefit/need for the manifest file is when your reads are already demultiplexed and the names are not casava format.
Yes, dada2 just needs to be run on separate sequencing runs, but it does not matter if the samples inside are unrelated… it’s all about error rate, not about sample similarity (but maybe there are exceptions to this, I am not sure —cc: @benjjneb ).
If they are all the same amplicon, then no it is not a major issue but theoretically trim length might alter merge quality or bias the merges (e.g., drop longer amplicons), introducing subtle (or not so subtle) biases. Different runs may just need different trim lengths — sometimes it is unavoidable — but keep it all within reason and cautiously examine denoising stats to make sure you get similar merge success rates.
Deblur requires reads to be joined upstream (unlike dada2). So either merge runs upstream or make sure you pick identical trim lengths!
Yes! Absolutely. That is the point of pre-training the classifiers — it is time-consuming to do on the fly, and these classifiers can be re-used!