Apologies for the long post, I am hoping that providing a lot of detail will make my question easier to understand …
Our lab has been using QIIME 1 for 16S amplicon analysis for a few years and I am testing out some analyses in QIIME 2 with a view to the whole group making the transition to QIIME 2 within the next year.
We work with 16S V4 region amplicons and use paired-end reads from an Illumina MiSeq.
In our QIIME 1 pipeline, we did the following:
converted the bcl files from the sequencer into fastq using bcl2fastq
concatenated R1, index and R3 files
merged reads using BBmerge
demultiplexed samples and performed quality control in QIIME 1 using split_libraries.py
performed open- and closed-reference OTU picking in QIIME 1
Our group has been keen to try out Deblur as an alternative to OTU picking, so that is what I have been working on. I learned from this forum that the deblur plugin only works with single-end sequences, so I decided to test out using R1 only vs. using paired-end reads merged with BBmerge and then imported into QIIME2 as ‘single-end’ sequences (I used manifest fastq format to import the reads into QIIME2 in both cases).
I tested this out on a subset of 10 samples and have found that the taxonomic profiles differ when using unmerged and merged reads – in particular, more taxa are identified when using R1 only.
Next I wanted to know which ‘additional’ taxa are being identified when using R1 compared to merged reads (or another way of looking at it is which taxa are lost when using the merged reads compared to R1 only). At first I tried filtering the features found in ‘merged-read’ samples from the table containing both ‘merged-read’ and ‘R1-only’ samples, but then I realised this probably won’t work because the merged reads are longer than the R1 reads, so there won’t be any sequence variants in common between the two. I think I really want to be working at the level of taxonomy here.
I think I might be able to answer this question either by:
trimming the merged reads to 150bp (same length as R1), then running deblur on the merged-read and unmerged-read samples together, then filtering to retain the features that are found only in R1-read samples, then assigning taxonomy to these to see what we are ‘missing’ by using merged reads.
However, I am a little concerned about losing the information in the last 100bp or so of the merged read by doing this – is there really any advantage in using the merged reads at all if I then trim them to be the same length as an unmerged read?
Is there some way to filter the feature table at the level of taxonomy rather than at the level of sequence variant? Or to filter the taxonomy.qza file? I have only found scripts for filtering features or samples by abundance or by a specified index or a metadata category, and I’m not sure if I could apply those here.
This is a really great question! I suspect the biggest driver of why you see less features on the merged data is that fewer sequences make it past this implicit filter of merging. In other words, if your forward and reverse reads don't align, then you probably won't end up with a sequence at all (although BBmerge may work differently, I can't really say).
Before even getting into the differences, you might runqiime demux summarizeon both data-sets to see if there are significantly less sequences available in your merged set.
Assuming that isn't what is causing the difference you see, we can explore the options you laid out. Option 1 might tell you something if BBmerge is the sole reason and is doing something odd at the intersection of the reads, but I think option 2 is a more complete approach (as it doesn't discard your bases), so I'll focus on that.
That's exactly right, but there is a way to remap those features to the taxonomy that they are classified as, which will give you a common language to find the difference between the tables.
This will be a little gross, because we don't have a way to use feature-tables as metadata (yet) or to filter FeatureData[Taxonomy] directly. But you should be able to do the following:
Run qiime taxa collapse on your both of your feature-tables (single-end and merged), which will map the feature IDs from ASVs to taxonomy strings (this will need your FeatureData[Taxonomy] artifact from each dataset). You may want to experiment with the --p-level parameter depending on how granular you need to be in finding the difference.
Next we need to find out which taxonomy strings are in your merged data. (This is the gross part.) What we need to do is run qiime feature-table summarize on your collapsed merged feature-table. Then on that first page, we're going to download the CSV for Frequency per feature detail:
You'll need to convert that to a TSV using your favorite spreadsheet program. You should now have a file that we'll call merged-taxonomy-strings.tsv which should contain a bunch of taxonomy strings in the first column, and then some numbers in the next column (we don't really care about those). EDIT: You will also need to add a header line, otherwise the first taxonomy string will be treated as a header and won't be used in the next command.
Finally, we'll use merged-taxonomy-strings.tsv to remove any of those features from our collapsed single-end feature-table (from step 1):
This should have the effect of removing every taxonomy from your single-end table that was seen in your merged data. Leaving you with only the unique observations in the single-end. You can then do whatever you want with this new "difference" table. I should mention that there's probably some assignments which will exist in the merged table but not in the single-end table, so just keep in mind that you only got rid of the assignments that were shared.
There are definitely many fewer sequences in the merged dataset: 486,655 total sequences versus 832,662 using unmerged R1. So there are clearly a lot of sequences in the R1 that have no overlapping R2 and so fail to merge.
However, the total number of features (sOTUs) present in each table was less than I might have expected due to the discrepancy in number of sequences (33 fewer in the merged dataset).
I guess our question is whether this loss of information in the merging step is introducing a bias we should be trying to avoid in our analysis.
I followed your suggested steps and it seems to have worked well. I collapsed the tables at Level 7. There were only 13 features in the final table (i.e. uniquely found in the unmerged table), so I guess some of the 30-odd extra features found in the unmerged dataset were similar enough to have the same classification at the species level. It would be quite nice to be able to look at what is different at the sequence variant level as well, but I’m guessing you would have suggested such a method if it existed … ?
Also, I’m not sure I follow why there might be some assignments that are in the merged table but not in the single-end table. Shouldn’t everything identified in the merged table also have been present in the single-end table? Or is it likely that taxonomic assignments could have shifted based on the extra information found in a longer read (i.e. some taxa might have been incorrectly assigned based on R1 only, but with the additional information from R2 they are correctly assigned a different taxonomy)?
Is there any potential in the future for the deblur plugin to be able to handle unmerged, paired-end data?
It doesn't sound like you've lost too much. The bias is going to depend on who's in your sample, as length can be a source of systematic bias if certain clades just happen to have a longer than average target region. Since you've already denoised both ways (this is usually the lengthy process) it should be relatively fast to run your analysis both ways and see if there are any meaningful differences in your interpretation of those results.
Awesome!
You could cross-reference using your FeatureData[Taxonomy] and qiime metadata tabulate. But that'll be a bit painful, and you still can't really map the single-end ASVs to the paired-end ASVs in any meaningful way (other than taxonomy/OTU clustering).
Exactly that. So it's possible your merged reads will have a more precise annotation thanks to the longer length.
I'm not sure. @wasade would know the answer to that though!
Thanks @ebolyen for the ping. With paired data, I really like going back to the RDP classifier paper, Wang et al 2007, which highlights the limitations of longer fragments when assigning taxonomy using naive Bayes. Ultimately, the assignments are going to be contingent on the correctness and resolution of the reference database irrespective of the assignment method.
@Matilda_H-D, as you observed as well, there is generally a loss of depth too in the pairing process. In addition, there is an implicit increased tolerance for error (both on the merge and in R2). I’m not aware of a comprehensive independent benchmark that has compared the pros and cons of merging reads; for questions of relative differences in alpha diversity, beta diversity, and probably taxonomy given the resolution of 16S, my guess is that for many questions asked in microbiome studies, pairing reads probably will not alter the conclusions drawn. It is possible that paired data may actually lead to inaccurate conclusions from reduced coverage or taxonomic misassignment due to merge or quality issues.
I do not believe it is assured that things identified in the merged table will be present in the single-end table. The merge process is unlikely to be perfect, many taxons are polyphyletic, and all references databases contain chimeras.
@Matilda_H-D, do your specific question, we do not have that planned at this time. Our present method of handling paired data is in effect how you used Deblur, where the pairing happens upstream and Deblur remains agnostic.
Thanks very much to both @ebolyen and @wasade for your helpful responses!
At this stage I think I will proceed with testing some analyses and seeing how much merging actually changes the conclusions.
One last thing: I repeated Evan’s suggested filtering workflow in reverse to see if there were any taxa that were uniquely found in the merged dataset. As you both suggested, there are indeed some things that appear only after merging, although there are only three of these features, compared to thirteen that are unique to the unmerged (at species level).
I think I understand what is going on, but just to be sure: two of the taxonomy strings that were ‘unique to the merged table’ were identical to two that were ‘unique to the unmerged table’. So this means that, if I followed the steps correctly, although there appear to be identical ‘unique’ things, they are actually different at the sequence variant and at the species level, we just can’t distinguish them from the taxonomic assignments although they could be distinguished by sequence? (The assignments in question are at genus and family level.)
By the time we start filtering the table (to see what's unique to that dataset) there isn't any ASV information left. Everything is a taxonomy string. So in principle, you shouldn't see any shared features between the two differences: unmerged - merged and merged - unmerged.
Could you share the "unique" taxonomy strings that seem to be shared? I'm guessing they are subtly different in some way that the computer is seeing (such as a trailing ; or s__ or something) but which is otherwise not very meaningful.
By the time we start filtering the table (to see what’s unique to that dataset) there isn’t any ASV information left. Everything is a taxonomy string. So in principle, you shouldn’t see any shared features between the two differences: unmerged - merged and merged - unmerged.
Hmm, but if the tables were collapsed at L7, isn't it possible there would be features (for want of a better word) that have the same taxonomic assignment (especially if not named down the species level) but are different enough to be classified as different species? Maybe I don't understand quite what is happening when they're collapsed. Should I actually have collapsed at a lower taxonomic level if I don't have species-level assignments for all the taxa?
Could you share the “unique” taxonomy strings that seem to be shared? I’m guessing they are subtly different in some way that the computer is seeing (such as a trailing ; or s__ or something) but which is otherwise not very meaningful.
So having just done that, I can see that the second pair (Veillonellaceae) are actually not exactly the same because one has the 'g' and 's' and the other doesn't (would you mind explaining why this would happen?), but the first pair still look the same to me.
I'm pretty sure I used the pretrained Greengenes 13.8 99% feature classifier to assign taxonomies -- sorry I'm not at the office at the moment so can't check for sure.
Your intuition is correct, it's just that collapsing basically sums up all of the observations at a given taxonomic level's assignment. So before you collapse the table, you have that information between your table and the FeatureData[Taxonomy], but once you collapse, the individual sequences get smushed together into the same taxonomy string.
As an example:
FeatureTable[Frequency]
f1 f2 f3
s1 1 1 1
s2 0 9 1
FeatureData[Taxonomy]
f1 k__X;p__Y;c__Z
f2 k__X;p__Y;c__Z
f3 k__A;p__B;c__C
After collapsing:
k__X;p__Y;c__Z k__A;p__B;c__C
s1 2 1
s2 9 1
The one without the g / s was only assigned to k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Veillonellaceae but that isn't 7 levels deep, so it qiime taxa collapse padded it out with __. The one that ended with g / s was assigned to an OTU that didn't have a genus/species specific annotation in Greengenes.
As for the first set, I honestly have no explanation as to why and how that happened. Would you be able to DM me the two collapsed and filtered tables? I can double check a couple things in the provenance to make sure something didn't go wrong.
(Just as a sanity-check, there's definitely some shared features that are removed from both tables right? The filtering hasn't failed to remove things?)
@Matilda_H-D discovered the mistake in my instructions that was causing ...;g__Corynebacterium;s__ to show up in both tables after filtering.
The first line of a metadata file is always treated as a header line. The above string was the first line in the .csv that was pulled off of the summarize visualizations. This meant that even though it was assigned in both datasets, the feature-data filter-features command never actually got to see the ID to remove it.