Trained Greengenes2 Classifier Only Classified Taxonomy to Level 2

Hello

I previously used QIIME2 2023.5 and 2023.7 q2-feature-classifier on MiSeq v3-v4 data and was able to get taxonomic classification down to Level 7 using both SILVA 138 and Greengenes2.

However, this time I encountered problem when I trained the Greengenes2 classifier. I am using QIIME2 2023.9 Amplicon to analyze an existing MiSeq v4 data.

The commands are as follows:
qiime feature-classifier extract-reads
--i-sequences 2022.10.backbone.full-length.fna.qza
--p-f-primer GTGYCAGCMGCCGCGGTAA
--p-r-primer GGACTACNVGGGTWTCTAAT
--o-reads gg-trained-refseqs.qza
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads gg-trained-refseqs.qza
--i-reference-taxonomy 2022.10.taxonomy.asv.tsv.qza
--o-classifier gg-trained-classifier.qza
qiime feature-classifier classify-sklearn
--i-classifier gg-trained-classifier.qza
--i-reads repseqs.qza
--o-classification gg-trained-taxonomy.qza

With the same commands as above for SILVA, I could get taxonomic classification up to Level 7.

I also notice that the output file size after "qiime feature-classifier extract-reads" for Greengenes2 is significantly smaller than the one for SILVA (9MB vs 16MB).

I appreciate any input/advice on this.

Thank you!

Best Regards
Stephanie

1 Like

Hi @Stephanie,

For V4 sequences, particularly those based off GTGYCAGCMGCCGCGGTAA, roughly 20M ASVs have already been placed so there is no need for a taxonomy assignment. Filtering against Greengenes2 with filter-features should be sufficient, and then you can pull taxonomy from the matched ASVs. The primary constraint is we only placed fwd reads and mostly at 90, 100, and 150nt as these are the typical focal point of analysis in Qiita. We observe better classification performance, relative to paired shotgun metagenomic preparations, using this phylogenetic taxonomy compared to Naive Bayes. If you'd like to use Naive Bayes though, an already trained classifier for V4 can be obtained from the data resources page.

As a polite note, SILVA does not characterize species. A discussion on that can be found here.

Best,
Daniel

4 Likes

Hi Daniel

Thank you for your reply. Noted on SILVA not characterizing species!

May I know how to use filter-features against Greengenes2 and then pulling the taxonomy from the matched ASVs? I apologize this sounds stupid; all along I have been using the Naive Bayes classifier.

In the meantime, I also tried the pretrained SILVA and Greengenes2 classifier for V4. I still get only taxonomy classification up to Level 2 for Greengenes2 (while that for SILVA is still up to Level 7, although only up to Level 6 is reliable). Why is it so?

Thank you again!

Best Regards
Stephanie

2 Likes

Hi @Stephanie,

An example of the phylogenetic taxonomy can be found here. And the questions being raised are not stupid!! Please ask questions!

That is unexpected with Naive Bayes. Could you share all or a subset of the ASVs being classified?

All the best,
Daniel

2 Likes

Hi Daniel

Thank you for following up.

I have also attached the repseqs file used for the feature-classifier classify-sklearn.
ggpretrainedtaxonomy.qza (100.4 KB)
repseqs.qza (98.8 KB)

In the meantime, I will try the greengenes2 filter-features.

Thanks again!

Best Regards
Stephanie

Hi Daniel

I just tried the greengenes2 filter-features but obtained an error.

qiime greengenes2 filter-features
--i-feature-table trimmed-table.qza
--i-reference 2022.10.taxonomy.asv.nwk.qza
--o-filtered-feature-table gg2filteredfeaturetable.qza

qiime greengenes2 taxonomy-from-table
--i-reference-taxonomy 2022.10.taxonomy.asv.nwk.qza
--i-table gg2filteredfeaturetable.qza
--o-classification gg2filterefeaturetable.taxonomy.qza

Plugin error from greengenes2:

No requested tips found

Debug info has been saved to /tmp/qiime2-q2cli-err-6xbtseel.log

Traceback (most recent call last):
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2cli/commands.py", line 520, in call
results = self._execute_action(
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2cli/commands.py", line 581, in _execute_action
results = action(**arguments)
File "", line 2, in taxonomy_from_table
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/qiime2/sdk/action.py", line 342, in bound_callable
outputs = self.callable_executor(
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/qiime2/sdk/action.py", line 566, in callable_executor
output_views = self._callable(**view_args)
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2_gg2/_methods.py", line 778, in taxonomy_from_table
tree = _load_tree_and_cache(open(str(reference_taxonomy)), features)
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2_gg2/_methods.py", line 543, in _load_tree_and_cache
tree = tree.shear(names & features)
File "bp/_bp.pyx", line 758, in bp._bp.BP.shear
File "bp/_bp.pyx", line 800, in bp._bp.BP.shear
ValueError: No requested tips found

For trimmed_table.qza, I used the output of --o-table of qiime dada2 denoise-paired I did at the very beginning.

Could you help me see what I did wrong?

Thank you.

Best Regards
Stephanie

2 Likes

Hi @Stephanie,

I think there are three issues.

First one is I recommend using the md5 variant of the Greengenes2 files as these sequence IDs are hashed.

Second, it looks very likely that the fwd primer is still on these sequences (GTGCCAGCAGCCGCGGTAA).

Third, we primarily only placed 90, 100 and 150nt fwd reads from V4 as that is the vast majority of what we index in Qiita. It looks like these data may be stitched. Would it be possible to filter against just the fwd read?

Best,
Daniel

┌─[dtmcdonald@there] - [~/Downloads] - [2023-12-12 09:03:16]
└─[0] <> head a8e02612-278d-48e5-962d-643150c8296a/data/dna-sequences.fasta
>5ea85fd574e9d1ae33ed8b1ee6d19d92
GTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGGCAGGCTAGAGTATGGCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGCCAATACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGTAGTCC
>f13a4fec61729a62b000f383c30c316a
GTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGGCAGGCTAGAGTATGGCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGCCAATACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCCGGTAGTCC
>d39e9ddb3af0def99c5afb50d9f38bc5
GTGCCAGCCGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGGCAGGCTAGAGTATGGCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGCCAATACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGTAGTCC
>0421b3fc80c06bcb09c00e6f9df33394
GTGCCAGCCGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGGCAGGCTAGAGTATGGCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGCCAATACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCCGGTAGTCC
>d02518a9e8a3d12b0578a1191cf7f523
GTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGGCAGGCTAGAGTATGGCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGCCAATACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCCAGTAGTCC
2 Likes

Hi Daniel

Thank you for your reply!

I am sorry I am confused. I used the forward and reverse primer sequences to train the SILVA and Greengenes2 classifiers so I thought I had to retain these primer sequences in my reads (I only used cutadapt to trim away the adapters). Is it actually the practice to use adapter- and primer-trimmed sequences on trained classifiers?

For filtering against forward reads, may I clarify that I will need to do the workflow from the beginning (i.e. starting from qiime tools import for single-end)?

Best Regards
Stephanie

1 Like

Hi @Stephanie,

Using qiime feature-classifier extract-reads should, unless I'm misunderstanding something, not include the primers themselves. Whether your actual data include the primers is protocol dependent. Their presence in training will impact the classifier but I doubt by much as the Naive Bayes model is based on k-mer frequencies.

In Qiita, which is the ASV source for Greengenes2, the primary data representation is EMP V4. We've focused on processing the fwd read only as we aren't aware of an independent benchmark that's shown use of the rev read is beneficial for taxonomy, phylogeny, or positive change in effect sizes. Longer isn't implicitly better as you are exposed to more error, the stitching process is unlikely to be perfect, and the additional length isn't guaranteed to provide great improvement in specificity (see for instance fig 1 of Wang et al 2007 AEM.

The EMP V4 primers were designed when instruments yielded shorter reads, and when the rev read was typically lower quality. Part of their design was to capture meaningful information proximal to the 5' end of the fwd read. More information on the design of the primers can be found here. And as two high impact examples of use of short fwd reads, Yatsunenko et al Nature 2012 had either 90 or 100nt (I don't remember right now...) forward reads, and the primary analyses of the Earth Microbiome Project Thompson et al Nature 2017 were at 90nt.

This is in part why we prioritized placing V4 fwd reads. So this is a bit of a long way to get to your second question :slight_smile: but, I recommend removal of the primers, and trimming such that the fwd read is either 150 or 100nt to use the phylogenetic taxonomy.

All the best,
Daniel

2 Likes

Hi Daniel

Thank you for the explanation and the references!

I managed to get the classification down to Level 7 after removing the primers and using only forward reads trimmed to 150nt, with Naive Bayes classifier.

However, I still encountered error using greengenes2 filter-features. I ran the following:
qiime greengenes2 filter-features
--i-feature-table trimmed-table-fwd.qza
--i-reference 2022.10.taxonomy.md5.nwk.qza
--o-filtered-feature-table trimmed-table-fwd-gg.qza
qiime greengenes2 taxonomy-from-table
--i-reference-taxonomy 2022.10.taxonomy.md5.nwk.qza
--i-table trimmed-table-fwd-gg.qza
--o-classification fwd-gg.taxonomy.qza

I visualized the taxonomy file using qiime metadata tabulate and found that the Feature IDs do not correspond to those in my Feature Table (trimmed-table-fwd.qza and repseqs-fwd.qza generated from DADA2) and the confidence for all Feature IDs is 1.0. What did I do wrong this time?

Thank you.

Best Regards
Stephanie

2 Likes

Hi @Stephanie,

I'm glad to hear the NB classifier is working!

Can you share the --i-table trimmed-table-fwd-gg.qza and fwd-gg.taxonomy.qza artifacts with me?

Best,
Daniel

1 Like

Hi Daniel

They are as follows:
trimmed-table-fwd-gg.qza (53.9 KB)
fwd-gg.taxonomy.qza (57.6 KB)

Thank you!

Best Regards
Stephanie

1 Like

Hi @Stephanie,

I'm not sure I understand the problem. It looks like those two files are consistent in their feature representation (see below)?

Regarding confidence, the FeatureData[Taxonomy] type requires a Confidence column. We lack a notion of confidence right now for the placement, and the derived taxonomy is read from the fragment placement to the root of the tree. So we the easiest thing to do was put a pseudo confidence value in of 1.0

Best,
Daniel

In [1]: import pandas as pd, qiime2, biom

In [2]: tab = qiime2.Artifact.load('trimmed-table-fwd-gg.qza').view(biom.Table)

In [3]: tax = qiime2.Artifact.load('fwd-gg.taxonomy.qza').view(pd.DataFrame)

In [4]: tab.shape
Out[4]: (459, 16)

In [5]: tax.shape
Out[5]: (459, 2)

In [6]: set(tab.ids(axis='observation')) == set(tax.index)
Out[6]: True

In [7]: tax.index[:5]
Out[7]: 
Index(['06cbfb08e7809ad415791b00d7bedeaa', '9938c8f22908861f780f8719b75dc0b2',
       '9c9e4f89021d6ce4202a78b32b9a9059', '6c451a691fe0cd84af764e8d9273ae8d',
       'f31d5f6204bca74d1fa5091b2ccd932f'],
      dtype='object', name='Feature ID')
1 Like

Hi Daniel

Thank you for your reply and for the explanation on the confidence value. I apologize for not being clear.

When I used taxa barplot, I encountered error.

qiime taxa barplot
--i-table trimmed-table-fwd.qza
--i-taxonomy fwd-gg.taxonomy.qza
--m-metadata-file metadata.txt
--o-visualization taxa-barplot-fwd.qzv

Plugin error from taxa: Feature IDs found in the table are missing from the taxonomy. (A long list of Feature IDs follows...)

When I did a quick comparison of Feature IDs in trimmed-table-fwd.qza and fwd-gg.taxonomy.qza, some could not be found in the taxonomy file.

Thanks again!

Best Regards
Stephanie

Hi @Stephanie,

It looks like the file previously sent was named trimmed-table-fwd-gg.qza whereas the command run used trimmed-table-fwd.qza. Is it possible the wrong input table was used to construct the barplot?

Best,
Daniel

1 Like

Hi Daniel

I understand now. I should use the output of greengenes2 filter-features (instead of the output of --o-table of DADA2) for --i-table in qiime taxa barplot. It is now solved!

Thank you so much for your help!

Best Regards
Stephanie

1 Like

That's great to hear!!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.