Hi all,
I have clustered my sequences into OTUs using QIIME2 with both otu-table.qza and rep-seq.qza. But I am still wondering if there is a way to associate the OTU IDs with the Read (Sequence) IDs in the fastq files.
Here is an example of the effect I wanted to achieve:
|Read-id|OTU-ids|
|SRR763292.1|OTU3|
|SRR763292.2|OTU89|
|SRR763292.3|OTU6923|
I’m not sure there’s a perfect way to do this, within Qiime2. Some of the intermediate files produced by vsearch would include this information, but these are not added to the .qza file.
Would you be willing to post all the Qiime2 commands you ran to cluster your sequences? We might be able to work backwards from your rep-seq.qza file, or think about exporting that data from the sqlite database that’s used internally.
Hello Colin,
Thank you so much for your reply .
I used the qiime vsearch cluster-features-closed-reference to pick up OTUs against SILVA v138 database at 97% identity. Here are the commands I used:
For your information, I am trying to follow the method introduced by a published paper. This paper provided a way to subset the OTUs clustered at 97% identity and re-clustered at 100% identity through the UNOISE3 pipeline. I have checked the R script they used. It seems like the “map file”, just like what I posted, is required. So I am curious if this will be possible on QIIME2 as well.
Many thanks again! Looking forward to hearing from you.
Thanks for posting your full pipeline! Let's dive in!
There are two associations we need to track to connect original read IDs to OTU IDs.
vsearch dereplicate-sequences will replace read IDs with their sha1 hash, then return an 'OTU Map' showing which read IDs mach to each hash, and a list of all unique hashes and their reads. (This is really a 'feature map' at this stage, as it's a mapping of features not OTUs.)
vsearch cluster-features-closed-reference will find the closest database entry for each hashed read over 0.97 (ideally), then return an 'OTU Map' showing which hashed read matches to each OTU in the database.
Without that feature natively implemented into Qiime2, a good workaround for vsearch cluster-features-closed-reference would be to perform closed-ref clustering using the reads inside your demux-trimmed.qza file.
Why this works:
That first vsearch dereplicate-sequences removes duplicate reads and replaces read names with hashes, but here we want all the reads and names, even if they are duplicates!
Closed-ref clustering is just database search (closed-ref OTU counting, if you will), and will return a list of reads and database hits that is exactly what you described in your first post!
I can help you craft this command if you would like, but first I think we should zoom out!
This is really useful context. This paper describes a method for doing 'old school' OTU clustering, followed by 'new school' amplicon denoising using uparse + unoise3. After we finish the cluster-features-closed-reference step discussed above, we will still need to replicate the rest of the pipeline.
There might be a better way: skip the old school OTU step and go directly to amplicon denoising.
If you look at the GitHub repo for Stopnisek 2021, you can see that work began in 2019, when the amplicon analysis world was grappling with the shift from 'old school' OTUs to 'new school' amplicon sequence variants (ASVs). I think this paper does a very good job bridging this gap, using both familiar OTUs methods and making great use of modern ASVs.
In the context of your analysis, my advice is to skip the old-school OTU step and go directly to denoising. UNOISE3 is described here if you want to use that. It's also been implemented in vsearch sense 2018. Qiime2 includes plugins for denoising with DADA2, and those are pretty well covered in the tutorials.
Going doing directly to the amplicon denoising does not change your core question
but it does streamline the pipeline.
Maybe we should zoom more...
What's your biological question? What evidence are you trying to collect by creating that table?
I am now doing a meta-analysis that compares the sequences from different hypervariable regions. So I decided to use closed-reference OTU picking. I did notice that DADA2 + fragment insertion can also work in comparing sequences from different regions. But I still chose the ‘old school’ closed-ref OTU picking because, firstly, fragment insertion is a very time- and memory-intensive step, especially for the large meta-analysis. Moreover, another paper published by Stopnisek and Shade in 2019 introduced a method to define the Core Microbiome, which is also what I would like to perform. They suggested using a lower identity threshold (e.g. 97%) rather than ASVs.
Does my choice sound reasonable to you?
However, in this era of ASV, I still want to bridge the gap between OTUs and ASVs. Therefore, I am trying to find a way to subset the OTUs clustered at 97% identity (Pick the core OTUs) and re-clustered at 100% identity.
Thank you very much again!
Looking forward to your reply.
Are you focusing on benchmarking different analysis methods, or is combining these data sets a means to an end that will let you get to the biology of interest?
If the goal is to combine data and focus on the downstream biology, I would try very hard to avoid the impossible questions about differences between hypervariable regions. Using close-ref OTU clustering counting is a very good way to normalize across hypervariable regions.
Absolutely! I want to make sure I understand your final goal so I'm not giving you more problems to solve along the way.
Hello Colin,
Many thanks again !
I have collected over 40 studies from different part of the world with publicly available data about the human oral microbiome. I am now trying to define the core oral microbiome and finding the variation caused by some physiological factors of hosts (e.g. obese vs normal weight/ ethnicity).
As you have known, I used close-ref OTU picking to generate the OTU tables.
Sorry about the stupid question, could you please clarify for me the difference between the two.
Because it is about the oral microbiome, I tend to use HOMD database. I've used data from 10 studies as the trial. However, the resulted number of OTUs (~900 features) after clustered against HOMD is much smaller than those clustered against Greengenes and Silva (~ 2500) (the results of taxonomy bar plots, alpha & beta analyses look similar). Is this because the reference sequence provided by HOMD is already very small? Or I did something wrong? If the former, will it have any adverse effects on downstream analyses (e.g. Differential taxa)?
Cool! So it sounds like the goal is the downstream biology, insead of algorithm development.
With that in mind, I think it makes sense to focus on an elegant way to integrate these 40 studies, without getting side tracked by differences between hypervariable regions.
Sure. De novo (Latin 'from nothing') OTU clustering makes new OTUs based on the reads provided. Closed-ref OTU clustering does not make new OTUs at all, it simply counts matches to existing OTUs provided in a database.
Take a look at this thread. Their ion-torrent data also spans different regions, so they are dealing with the same underlying problem as you. They also consider using closed-ref OTUs for these reasons:
This makes sense. You can only get OTUs in that specific database, so this bias is huge. Hopefully alpha and beta are similar
oh thank goodness!
I'm not sure about the HOMD database, but it makes sense that a special-use database would have fewer taxa than a general use database like Silva. Any OTUs not in the database will be totally missing from all downstream analysis, but that's the downside of this strong regional normalization
Now that I know more about your study, I want to go back to the beginning: I think you choose the right method to move forward quickly: closed-ref + maybe some ASV analysis of specific taxa
Do you want to craft that vsearch command?
Do you have any other theory questions about closed-ref OTUs?
Yes! Dereplication replaces read names with the sha1 hash of the read, and we need to keep those original sample IDs.
Here's my first draft of the pipeline
Get your demultiplexed reads:
Export your demux-trimmed-joined-filtered.qza file (docs), then find the .fastq files inside the data folder.
Also export your silva138-97.qza database file too.
Decide if it's best to process each sample separately, or if you are able to combine samples and process all reads together:
If all your samples have the right labels, like SRR763292.1 from your example, you can combine all your fastq files into a single input. But you might need to add these labels, or do the search step separately. I'm not 100% sure what the reads will look like at this step of the original pipeline
Take a look at that .uc output file
That output .uc file should be close to that table you describe, with some extra columns. We can parse it into the table you mention using awk or something, but let's see how it looks first!
This is the example output using HOMD database as the reference.
But I still got some additional questions...Is there any way to achieve the same result when doing denoising (e.g., DADA2) on QIIME2? Or I have to use UNOISE3?
You could use this same search method to associate reads with ASVs created with DADA2 or UNOISE3, but there is a catch. Both DADA2 and UNOISE try to make sequence variants / zOTUs with as little as one difference in base pairs. Unlike OTU clustering or OTU counting, this does not involve a set identity threshold like --id 0.97.
So you could use this same method at 100%, but this would not find sequences attributed to that ASV if there was a single base pair difference.
Or you could search at a lower threshold like 97%, but this could find extra sequences not included in the original ASV.
It's possible, but I wanted to discuss the limitations before you try it.