I have two MiSeq runs that have to be pre-processed and demultiplexed in QIIME 1 prior to importing into QIIME 2. These runs have overlapping barcodes. I want to run these two datasets together through clustering with the Open Reference VSearch tool.
What is the best way to combine these runs and still retain the sample IDs? I see two options: 1) Should I CAT the seq.fna file outputs from QIIME 1 and then import this new merged file into QIIME 2?
I run dereplication and chimera filtering in QIIME 2 prior to VSearch - so
2) is there a way to merge these filtered .qza sequence and table files prior to clustering in QIIME 2? This second option allows me to filter out PCR errors and sequencing artifacts from each individual run and seems to me to be the more accurate way to process my two runs prior to merging but I am uncertain on how to merge the two sets of .qza files after this point but before I do the OTU picking step.
Hi @Sara_Jeanne08,
Either of those options should work fine for you. You could cat the sequences files together and then import the resulting file, or you could import the two sequences files, run qiime vsearch dereplicate-sequences twice, and then merge the resulting feature tables with qiime feature-table merge and the resulting sequence files with qiime feature-table merge-seqs. I don’t think there will be any practical difference between these two processes, so I would just recommend going with which ever is easier for you. (The exception to this is if you have some of the same samples showing up in both sequences files, in which case you should import them separately into QIIME 2, dereplicate twice, and then merge.) After this, you’ll be ready to proceed to open-reference clustering.
Thank you for your help. I appreciate you helping me figure out which option is best for accuracy - It is good to know that there is not a difference between the two. I have already dereplicated and removed chimeras from both of these datasets individually, so it seems like merging the feature table and seqs would be my best option for moving forward with the combined analysis.
I do have a question about having overlapping IDs - the --p-overlap-method parameter - how does this work?
My guess, based on your description, is that you want to use the default setting which is to error on overlapping sample ids. This means that you are combining tables which contain some, all, or none of the same features, but do not contain any of the same samples. This makes the merge straight-forward, as the individual counts are never modified. However, if some samples show up in more than one table, you'll get an error. If you do have samples and features that show up in more than one table, you can use the sum option, which will sum the counts for sample/feature pairs that show up in more than one table.
You might be wondering why you wouldn't use sum all the time. Using error_on_overlapping_samples is now faster than sum (as of QIIME 2 2018.2), but it's also a good option if you're not expecting samples to show up in more than one table, as it will error if the tables aren't meeting that expectation.
I do have a few samples with the same IDs, so I tried to use the sum option for the --p-overlap-method parameter. Unfortuately I am getting errors trying to merge my table and sequences. Below is the command I passed and the output. I have tried both a space separating the two file paths and a comma, neither worked: