I have clustered my sequences into OTUs using QIIME2 with both otu-table.qza and rep-seq.qza. But I am still wondering if there is a way to associate the OTU IDs with the Read (Sequence) IDs in the fastq files.
Here is an example of the effect I wanted to achieve:
I’m not sure there’s a perfect way to do this, within Qiime2. Some of the intermediate files produced by vsearch would include this information, but these are not added to the .qza file.
Would you be willing to post all the Qiime2 commands you ran to cluster your sequences? We might be able to work backwards from your rep-seq.qza file, or think about exporting that data from the sqlite database that’s used internally.
For your information, I am trying to follow the method introduced by a published paper. This paper provided a way to subset the OTUs clustered at 97% identity and re-clustered at 100% identity through the UNOISE3 pipeline. I have checked the R script they used. It seems like the “map file”, just like what I posted, is required. So I am curious if this will be possible on QIIME2 as well.
Many thanks again! Looking forward to hearing from you.
(Matthew Ryan Dillon)
Thanks for posting your full pipeline! Let’s dive in!
There are two associations we need to track to connect original read IDs to OTU IDs.
vsearch dereplicate-sequences will replace read IDs with their sha1 hash, then return an ‘OTU Map’ showing which read IDs mach to each hash, and a list of all unique hashes and their reads. (This is really a ‘feature map’ at this stage, as it’s a mapping of features not OTUs.)
vsearch cluster-features-closed-reference will find the closest database entry for each hashed read over 0.97 (ideally), then return an ‘OTU Map’ showing which hashed read matches to each OTU in the database.
Without that feature natively implemented into Qiime2, a good workaround for vsearch cluster-features-closed-reference would be to perform closed-ref clustering using the reads inside your demux-trimmed.qza file.
Why this works:
That first vsearch dereplicate-sequences removes duplicate reads and replaces read names with hashes, but here we want all the reads and names, even if they are duplicates!
Closed-ref clustering is just database search (closed-ref OTU counting, if you will), and will return a list of reads and database hits that is exactly what you described in your first post!
I can help you craft this command if you would like, but first I think we should zoom out!
This is really useful context. This paper describes a method for doing ‘old school’ OTU clustering, followed by ‘new school’ amplicon denoising using uparse + unoise3. After we finish the cluster-features-closed-reference step discussed above, we will still need to replicate the rest of the pipeline.
There might be a better way: skip the old school OTU step and go directly to amplicon denoising.
If you look at the GitHub repo for Stopnisek 2021, you can see that work began in 2019, when the amplicon analysis world was grappling with the shift from ‘old school’ OTUs to ‘new school’ amplicon sequence variants (ASVs). I think this paper does a very good job bridging this gap, using both familiar OTUs methods and making great use of modern ASVs.
In the context of your analysis, my advice is to skip the old-school OTU step and go directly to denoising. UNOISE3 is described here if you want to use that. It’s also been implemented in vsearch sense 2018. Qiime2 includes plugins for denoising with DADA2, and those are pretty well covered in the tutorials.
Going doing directly to the amplicon denoising does not change your core question
but it does streamline the pipeline.
Maybe we should zoom more…
What’s your biological question? What evidence are you trying to collect by creating that table?
(Matthew Ryan Dillon)
I am now doing a meta-analysis that compares the sequences from different hypervariable regions. So I decided to use closed-reference OTU picking. I did notice that DADA2 + fragment insertion can also work in comparing sequences from different regions. But I still chose the ‘old school’ closed-ref OTU picking because, firstly, fragment insertion is a very time- and memory-intensive step, especially for the large meta-analysis. Moreover, another paper published by Stopnisek and Shade in 2019 introduced a method to define the Core Microbiome, which is also what I would like to perform. They suggested using a lower identity threshold (e.g. 97%) rather than ASVs.
Does my choice sound reasonable to you?
However, in this era of ASV, I still want to bridge the gap between OTUs and ASVs. Therefore, I am trying to find a way to subset the OTUs clustered at 97% identity (Pick the core OTUs) and re-clustered at 100% identity.
Thank you very much again!
Looking forward to your reply.