Re-classify sequences from gg to silva using .qza file

basil0125 · February 25, 2021, 9:16pm

Hello,

This is my first time working with 16S samples and qiime2 data artifacts. I have QIIME2 data artifacts that were provided to me for downstream statistical analysis (including taxonomy.qza and featuretable.qza). I’d like to change the taxonomic classification method of my ASVs from GreenGenes to Silva. Is this something I do with my qza files? Otherwise, I have fastq files (four per sample, I1, I2, R1, R2) but I have not previously worked with this type of data and would not know where to begin with re-classification.

In reviewing the q2-feature-classifier tutorial, it looks like I need: i) a reference taxonomy file, ii) my sequences. Can I extract my sequences from the taxonomy.qza file? (I think they are somewhere in the taxonomy.qza file based on looking at the Provenance Graph but perhaps I am misunderstanding).

Any other information would help. Thanks very much.

jwdebelius · February 25, 2021, 11:18pm

Hi @basil0125,

Welcome to the forum!

There are a couple of ways to tackle re-classification, but it's non-trivial. Did you get an artifact that's a semantic type FeatureData[Sequence] or a representative set? It would be an output of denoising. If not, i would contact the person who generated your data because they're the basis of several downstream analyses. (If they can't provide this file, I would ask them to reprocess the data and provide it to you!)

If you can't get that file, you might still be able to get the sequences if the feature ids in the table or taxonomy are sequences. (If they're hashed, you're out of luck). In that case, you need to make a fasta file with the sequence name and sequence (the sequence twice). If my ids are WANTCAT and CATCATCAT, my file would look like this:

> WANTCAT
WANTCAT
> CATCATCAT
CATCATCAT

Alternatively, you can reprocess the data yourself. I'd start with the PD Mice tutorial.

Best,
Justine

basil0125 · February 26, 2021, 3:39am

Hello Justine,

Thanks for your response! In examining the Provenance Graph, I do see a FeatureData[Sequence] with format: "DNASequencesDirectoryFormat"

Forgive my ignorance- does this mean that this piece was used to create the ultimate taxonomic classifications, or that it is actually stored in the qza file? (I.e., can I extract it?)

(Unfortunately my feature table was hashed.)

jwdebelius · February 26, 2021, 3:42am

Hi @basil0125,

That means they generated and used the file. But, the artifact you have does not contain that file. (They would get pretty big if they did!)

So, you either need to get the file from the person who produced your results for you (I’d go for broke and ask for denoising stats as well), or you need to re-process your data.

Best,
Justine

basil0125 · March 1, 2021, 9:29pm

Hi Justine,

Got it- thanks! I was able to obtain the representative_sequences.qza.

If GG was initially chosen because it resolved the Zymo positive controls more effectively than Silva, will I run into major issues actually using the Silva database? Is there a way to quantify “better resolve” of controls?

Thanks.

jwdebelius · March 1, 2021, 9:55pm

Hi @basil0125,

I’m glad you go the representative sequences !

I would not expect to have major issues, unless you’re focusing on the quality of annotation for a Zymo mock control. Taxonomy, especially as you get more specific, gets complicated between databases and the quality of annotation varies.

Best,
Justine

system · April 2, 2021, 3:55am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.