Trace back taxonomic classification to corresponding read IDs

Salocin · February 19, 2021, 10:26am

Hi all,

Is there a function within QIIME2 to connect taxonomic classification not only to the corresponding feature ID/sequence but to the read IDs of the sample fasta/fastq file?
I am also open for every custom made solution.

Best wishes,
Marc Bendisch

timanix · February 19, 2021, 10:47am

Hi!
I am not sure if it is what you are looking for, but I wrote a script in Python that modifies sequences IDs, representative sequences and taxonomy file by adding last available taxonomy unit to the sequence ID, so all downstream analysis in Qiime2 may be performed with these modified files. Let me know if you need the code

Salocin · February 19, 2021, 11:13am

Thanks for the offer @timanix !
Basically, I want to be able to see how every read from my sample file was classified. It would be perfect if I could get the whole lineage, but the highest ranking classification should also be sufficient. Sounds like something your script can provide, if I understand you right.

timanix · February 19, 2021, 11:16am

You can check this link, I provided some examples there.
The code may need some adjustments based on the dataset taxonomy

Salocin · February 19, 2021, 1:45pm

Sadly I don't think your script can solve my problem. What I am interested in is something like this:

But instead of a taxonomic annotation to the corresponding Feature ID, I need taxonomic annotation directly to the corresponding Read ID or at least the whole original read existing in the sample fastq file.

llenzi · February 19, 2021, 2:22pm

Hi @Salocin,

What command/tool do you use to get the feature table?

If you dealing with OTUs obtained by VSEARCH, you may be able to output the file listing in which OTU fell each sequences. So you can do a bit of scripting to merge all the info into a final table.

If you got ASVs, the things is a bit more complicate because I don't think dada2 and deblur tracks the destiny of each sequences after the denoising, but just pull out the final count for each unique ASV obtained.
A potential way for you to do what you need is to align back all the sequences onto the sequence feature file (vsearch may do that for you), and from the output file figure out the best match for each sequence. Then, again, merge all the info with a bit of scripting.
(The other way I can see is to taxonomy classify all the sequences, but I assume it is not doable of course ...)

I hope it make sense, and it is helpful
Cheers

Salocin · February 24, 2021, 9:52am

Hi @llenzi,

thanks for the help!
In fact there is a way to track individual reads within the dada2 workflow:

https://github.com/benjjneb/dada2/issues/889

llenzi · February 24, 2021, 3:57pm

Hi @Salocin,
thanks for sharing the solution you found! Definitely a good thing to know!
Given you will have to work on dada2 within R to use this trick, my suggestion is to run R in the qiime2 conda environment you used for the qiime2 analysis.
The only reason is to be sure to use the same version of dada2 in both the cases (qiime2 and R standalone), and so to be able to replicate (by using same settings) the q2-dada2 plug in result.
Cheers
Luca