I previously used QIIME2 2023.5 and 2023.7 q2-feature-classifier on MiSeq v3-v4 data and was able to get taxonomic classification down to Level 7 using both SILVA 138 and Greengenes2.
However, this time I encountered problem when I trained the Greengenes2 classifier. I am using QIIME2 2023.9 Amplicon to analyze an existing MiSeq v4 data.
With the same commands as above for SILVA, I could get taxonomic classification up to Level 7.
I also notice that the output file size after "qiime feature-classifier extract-reads" for Greengenes2 is significantly smaller than the one for SILVA (9MB vs 16MB).
For V4 sequences, particularly those based off GTGYCAGCMGCCGCGGTAA, roughly 20M ASVs have already been placed so there is no need for a taxonomy assignment. Filtering against Greengenes2 with filter-features should be sufficient, and then you can pull taxonomy from the matched ASVs. The primary constraint is we only placed fwd reads and mostly at 90, 100, and 150nt as these are the typical focal point of analysis in Qiita. We observe better classification performance, relative to paired shotgun metagenomic preparations, using this phylogenetic taxonomy compared to Naive Bayes. If you'd like to use Naive Bayes though, an already trained classifier for V4 can be obtained from the data resources page.
As a polite note, SILVA does not characterize species. A discussion on that can be found here.
Thank you for your reply. Noted on SILVA not characterizing species!
May I know how to use filter-features against Greengenes2 and then pulling the taxonomy from the matched ASVs? I apologize this sounds stupid; all along I have been using the Naive Bayes classifier.
In the meantime, I also tried the pretrained SILVA and Greengenes2 classifier for V4. I still get only taxonomy classification up to Level 2 for Greengenes2 (while that for SILVA is still up to Level 7, although only up to Level 6 is reliable). Why is it so?
Debug info has been saved to /tmp/qiime2-q2cli-err-6xbtseel.log
Traceback (most recent call last):
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2cli/commands.py", line 520, in call
results = self._execute_action(
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2cli/commands.py", line 581, in _execute_action
results = action(**arguments)
File "", line 2, in taxonomy_from_table
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/qiime2/sdk/action.py", line 342, in bound_callable
outputs = self.callable_executor(
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/qiime2/sdk/action.py", line 566, in callable_executor
output_views = self._callable(**view_args)
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2_gg2/_methods.py", line 778, in taxonomy_from_table
tree = _load_tree_and_cache(open(str(reference_taxonomy)), features)
File "/home/stephkoo/mambaforge/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2_gg2/_methods.py", line 543, in _load_tree_and_cache
tree = tree.shear(names & features)
File "bp/_bp.pyx", line 758, in bp._bp.BP.shear
File "bp/_bp.pyx", line 800, in bp._bp.BP.shear
ValueError: No requested tips found
For trimmed_table.qza, I used the output of --o-table of qiime dada2 denoise-paired I did at the very beginning.
First one is I recommend using the md5 variant of the Greengenes2 files as these sequence IDs are hashed.
Second, it looks very likely that the fwd primer is still on these sequences (GTGCCAGCAGCCGCGGTAA).
Third, we primarily only placed 90, 100 and 150nt fwd reads from V4 as that is the vast majority of what we index in Qiita. It looks like these data may be stitched. Would it be possible to filter against just the fwd read?
I am sorry I am confused. I used the forward and reverse primer sequences to train the SILVA and Greengenes2 classifiers so I thought I had to retain these primer sequences in my reads (I only used cutadapt to trim away the adapters). Is it actually the practice to use adapter- and primer-trimmed sequences on trained classifiers?
For filtering against forward reads, may I clarify that I will need to do the workflow from the beginning (i.e. starting from qiime tools import for single-end)?
Using qiime feature-classifier extract-reads should, unless I'm misunderstanding something, not include the primers themselves. Whether your actual data include the primers is protocol dependent. Their presence in training will impact the classifier but I doubt by much as the Naive Bayes model is based on k-mer frequencies.
In Qiita, which is the ASV source for Greengenes2, the primary data representation is EMP V4. We've focused on processing the fwd read only as we aren't aware of an independent benchmark that's shown use of the rev read is beneficial for taxonomy, phylogeny, or positive change in effect sizes. Longer isn't implicitly better as you are exposed to more error, the stitching process is unlikely to be perfect, and the additional length isn't guaranteed to provide great improvement in specificity (see for instance fig 1 of Wang et al 2007 AEM.
The EMP V4 primers were designed when instruments yielded shorter reads, and when the rev read was typically lower quality. Part of their design was to capture meaningful information proximal to the 5' end of the fwd read. More information on the design of the primers can be found here. And as two high impact examples of use of short fwd reads, Yatsunenko et al Nature 2012 had either 90 or 100nt (I don't remember right now...) forward reads, and the primary analyses of the Earth Microbiome Project Thompson et al Nature 2017 were at 90nt.
This is in part why we prioritized placing V4 fwd reads. So this is a bit of a long way to get to your second question but, I recommend removal of the primers, and trimming such that the fwd read is either 150 or 100nt to use the phylogenetic taxonomy.
I managed to get the classification down to Level 7 after removing the primers and using only forward reads trimmed to 150nt, with Naive Bayes classifier.
However, I still encountered error using greengenes2 filter-features. I ran the following:
qiime greengenes2 filter-features
--i-feature-table trimmed-table-fwd.qza
--i-reference 2022.10.taxonomy.md5.nwk.qza
--o-filtered-feature-table trimmed-table-fwd-gg.qza
qiime greengenes2 taxonomy-from-table
--i-reference-taxonomy 2022.10.taxonomy.md5.nwk.qza
--i-table trimmed-table-fwd-gg.qza
--o-classification fwd-gg.taxonomy.qza
I visualized the taxonomy file using qiime metadata tabulate and found that the Feature IDs do not correspond to those in my Feature Table (trimmed-table-fwd.qza and repseqs-fwd.qza generated from DADA2) and the confidence for all Feature IDs is 1.0. What did I do wrong this time?
I'm not sure I understand the problem. It looks like those two files are consistent in their feature representation (see below)?
Regarding confidence, the FeatureData[Taxonomy] type requires a Confidence column. We lack a notion of confidence right now for the placement, and the derived taxonomy is read from the fragment placement to the root of the tree. So we the easiest thing to do was put a pseudo confidence value in of 1.0
Best,
Daniel
In [1]: import pandas as pd, qiime2, biom
In [2]: tab = qiime2.Artifact.load('trimmed-table-fwd-gg.qza').view(biom.Table)
In [3]: tax = qiime2.Artifact.load('fwd-gg.taxonomy.qza').view(pd.DataFrame)
In [4]: tab.shape
Out[4]: (459, 16)
In [5]: tax.shape
Out[5]: (459, 2)
In [6]: set(tab.ids(axis='observation')) == set(tax.index)
Out[6]: True
In [7]: tax.index[:5]
Out[7]:
Index(['06cbfb08e7809ad415791b00d7bedeaa', '9938c8f22908861f780f8719b75dc0b2',
'9c9e4f89021d6ce4202a78b32b9a9059', '6c451a691fe0cd84af764e8d9273ae8d',
'f31d5f6204bca74d1fa5091b2ccd932f'],
dtype='object', name='Feature ID')
It looks like the file previously sent was named trimmed-table-fwd-gg.qza whereas the command run used trimmed-table-fwd.qza. Is it possible the wrong input table was used to construct the barplot?
I understand now. I should use the output of greengenes2 filter-features (instead of the output of --o-table of DADA2) for --i-table in qiime taxa barplot. It is now solved!