extract original sample reads assigned to certain taxa

Hi! I have used QIIME to analyze some 16S V4 data and am currently stuck on something that should be pretty obvious (but I can't find anything relevant already in the forum).

Now that I have taxonomic IDs for my features, I want to go back to the original sample and pull out all the individual reads that were assigned to that taxonomy so I can align them all and calculate the percent identity between the 16S Silva hit and the original reads (i.e. is this the exact same species or just the closest relative in available in the database).

I have only been able to extract the feature ID/ representative sequence for this read and can't figure out how to get all of the reads across all of my samples so i'm hoping someone on here can help!

Thank you :slight_smile:

Welcome to the forum!
To get counts by taxa, you can collapse the feature table to the taxonomy level.
Also, it is possible to export a feature table as a biom file and then convert it to tsv file. Taxonomy can be added to tsv file using R or Python or even Excel.
Finally, it is possible to filter the feature table to contain only IDs you provide as input.

Please let us know if you will need help with one of the ways above or if I misunderstood your question and did not answer it correctly.


1 Like

Hi - thanks for your reply :slight_smile:

What i'm trying to get is the original reads (from my sample FASTQ files) that were assigned to a certain taxa using QIIME2. But I'm having a hard time with this because all I can seem to figure out is exporting a single representative sequence for each taxa...is there a way to link the taxa to the original FASTQ sequences for each sample?

The only way I can think of is to get ASV IDs from the taxonomy file that were assigned to the taxa of interest and then get full sequences from rep-seqs.qza file. But it may be challenging to search for these sequences in original fastq files since in rep-seqs.qza file paired reads are merged into one sequence...

Thanks for your help! I figured out a few things so wanted to share here in case it helps anyone out in the future:

One thing I was having difficulty with was that using the new Greengenes2 plugin was providing feature IDs for my ASVs that were inconsistent with the feature IDs from the DADA2 denoising table - so I had no way to match up the tax ID with the FASTA sequences from the rep-seqs.qza file. I stumbled across another thread of someone who had the same issue (see here: ASV table and Taxonomy table have different formats with greengenes2 - #2 by wasade) and it looks like it may be an incompatibility issue when trying to use original QIIME2 features with the new Greengenes2 plugin. When I just used the QIIME2 classifier to assign taxonomy the feature IDs were consistent and I was able to link the rep-seqs.qza to my tax IDs from Greengenes2.

E.g., replaced Greengenes2 plugin code:
qiime greengenes2 non-v4-16s
--i-table denoise_table_filtered.qza
--i-sequences denoise_seqs_filtered.qza
--i-backbone 2022.10.backbone.full-length.fna.qza
--o-mapped-table gg2_table.qza
--o-representatives gg2_seqs.qza

qiime greengenes2 taxonomy-from-table
--i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza
--i-table gg2_table.qza
--o-classification gg2_table_taxonomy.qza

with QIIME2 classification code:
qiime feature-classifier classify-sklearn
--i-reads denoise_seqs_filtered.qza
--i-classifier gg_2022_10_backbone.v4.nb.qza
--o-classification gg2_table_taxonomy_sklearn.qza

qiime tools export
--input-path gg2_table_taxonomy_sklearn.qza
--output-path gg2_taxa

Something interesting I noticed from doing this was how different my resulting number of classified features are between the 2 above methods - and I don't really understand why.

With the Greengenes2 plugin I ended up with only 602 classified ASVs in the gg2_table.qza from 8,282 input ASVs from DADA2.

With the QIIME2 sklearn classification method I ended up with almost all of my ASVs classified (8,237/8,282).

Is there a reason this could be happening? Am I totally missing something?

Thanks again for your help :slight_smile:


Hi again!
Congratulations with solving the issue!
I was not aware that you used GG2 method to assign taxonomy - in the initial post Silva was indicated, so I supposed that you used scikit-learn approach.
But you are totally right - GG2 will replace sequences from your dataset with sequences from database to which they were aligned. So IDs after Dada2 and after GG2 do not match. I was confused because I thought that you want to trace origin of sequences after Dada2 in raw fastq files.
As I understand, GG2 will align your sequences to the database. So numerous ASVs can be assigned to the same sequence from database, and the number of unique sequences will decrease after GG2. I also suppose that unassigned sequence will be discarded, but this part should be checked with GG2 developers.


1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.