Silva accession ID export

qwcheng · September 29, 2019, 8:25pm

I'm using QIIME 2 2019.1 to process my 16S data from Illumina sequencing. I basically followed the "Moving Pictures" tutorial and conducted taxonomic analysis with a self-trained Silva 132 99 classifier. Everything turns out perfectly, except when I was trying to get the Silva accession number corresponding to individual microbial species. I was only able to find the long OTU ID (e.g.,000434dac16f4575fbd144799a8a97e2) and taxonomy (e.g.,D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Enterobacteriales;D_4__Enterobacteriaceae;D_5__Klebsiella;D_6__uncultured organism). I also want to export the Silva accession number (e.g., HQ774489) for each microbial species, so that other people are able to search the species in Silva online database. The accession number for each species is listed in the taxonomy_7_levels.txt file in the Silva_132_release.zip. I've tried the links in rep-seqs.qzv file, with which I can blast the sequence and get a million matches with different scores. But I'm not sure which match is the one I got from my taxonomic analysis in QIIME 2.

Can somebody help with this issue? Thanks!

timanix · September 30, 2019, 6:12am

Hi! You can use long taxonomy names to search for accession ID in reference sequences you used for classifier training, but you need to use Python or other language to write a script.
I will be glad if someone here will provide a better solution.

jwdebelius · September 30, 2019, 7:27am

Hi @qwcheng,

You ideas are in the right place! If you followed the basic moving pictures tutorial, I'm guessing you denoised with Dada2 and then did feature classification using classify-sklearn against Silva?

The technqiue you use with classify sklearn is an alignment based approach where, basically, your sequence and the reference db are broken into "words" and based on the breakdown of the "words" the algorithm determines where the sequence belongs. It was explained to me like my email. If I get a message from an unknown sender that's a .edu email address and contains the words, "ASV", "Silva" and "feces" its probably a real message for me, but if I get an email from an unknown sender with a .com email address that contains the words "feces", "glitter" and "standard post", it might be a promotional email that I don't want. The final "bin" the message goes into doesn't necessarily map back perfectly to the original bag of training words (again, because I work a lot of fecal samples, "feces" can be a neutral word for me), just like your blast didn't map back perfectly.

The cool thing about ASVs, though, is that they're externally valid () single sequences that are independent of your database. Your ASV ID (000434dac16f4575fbd144799a8a97e2) is actually representative of the sequence contained in your data. And so, if I want to compare your data against mine in the future, I can actually just look for your ASV IDs (or hopefully, a rep set in your supplement) and figure out if I can replicate your important ASV in my data. Which, to some degree, is better than a silva map, IMO.

If you really want the Silva names, you could also cluster your denoised sequences closed reference against Silva (see the OTU-clustering tutorial) and then each of your features would be assigned a Silva ID based on that classification.

Best,
Justine

system · October 31, 2019, 1:27pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.