Wrong 'taxonomy.qzv' file

Dchung · July 27, 2018, 11:41am

Hi all,
I was having trouble testing the feature-classifier that I have generated through following "Training feature classifier tutorial"(Training feature classifiers with q2-feature-classifier — QIIME 2 2018.6.0 documentation)

V3 and V4 of 16S sequence were extracted from the SILVA full-length database using the f/r primer sequences given by the sequencing facility. Below was the script I used.

qiime feature-classifier extract-reads
--i-sequences silva_132_97_16S.qza
--p-f-primer CCTAYGGGRBGCASCAG
--p-r-primer GGACTACNNGGGTATCTAAT
--o-reads ref-seqs_V3V4.qza

Classifier was trained as below:
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs_V3V4.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier classifier.qza

And the classifier was tested using my 16S dataset of 17 samples (testset):
qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads rep-seqs.qza
--o-classification taxonomy.qza

qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv

When done, I viewed the taxonomy.qzv file and it gave me the same taxon for all 1606 of my feature IDs:
D_0__Archaea;D_1__Nanoarchaeaeota;D_2__Nanoarchaeia;D_3__Nanoarchaeales;D_4__Nanoarchaeaceae;D_5__Nanoarchaeum;D_6__hot springs metagenome

What may have caused this?

I've checked to see if the classifier.qza file was wrong by using the rep-set.qza file from the "MovingPictures tutorial", and it seemed fine (various taxons were listed)
I've visualized the rep-set.qzv, table.qzv and denoising-stats.qzv files but none of them show one single feature.

Thanks in advance for the help.

Nicholas_Bokulich · July 27, 2018, 12:28pm

Hi @Dchung,
Sorry to hear you've been having trouble with this!

Honestly, based on your description it sounds like that one taxonomic assignment might just be all that's in your samples (though it is suspicious). Here's the key:

Thanks for doing that test! So we know the classifier works...

You could also try using one of the pre-trained full-length 16S classifiers distributed on the :qiime2: website to get a "second opinion" — if you get a more satisfying result I'd say just use that (trimming to your primers only gives a slight boost in performance anyway).

It is possible to have different features receive the same taxonomy classification. Distinct features are just unique sequences here, and do not necessarily have different taxonomic affiliations. Though 1606 features receiving the same classification is quite suspicious.

If using one of the pre-trained classifiers does not fix your issue, would you mind sharing the following files?

taxonomy.qza
rep-set.qza + qzv
table.qzv

Also please try qiime taxa barplot and share that file.

Thanks!

Dchung · July 30, 2018, 1:19pm

Hi Nicholas,
Thank you for the reply.
I've run it with the pre-trained full-length 16S classifier(SILVA) and got a more satisfying result.

I moved on with the pre-trained one but still a bit bugged down with my V3V4 extracted classifiers because it gave exactly the same taxon for another set of data I have, for all 4802 feature IDs.
So I'm guessing there's something fishy with the classifier

I ran qiime taxa barplot and it is attached below:
taxa-bar-plots_prefiltered_ASQ.qzv (401.9 KB)

Looking at the barplot results, I've noticed another problem; the samples with the name "AJM" showed almost no variation in the taxonomy. The table.qzv file showed very low sequence depths for those samples.

It was difficult for me to understand why this is, because I've seen somewhat similar barplots for all samples when I ran it through qiime1:

Is such difference possibly coming from different properties of OTUs and ASVs?
Or is my table.qza file problematic?

Thanks,

Nicholas_Bokulich · July 30, 2018, 10:13pm

I'm not really sure what's going wrong with the V3V4 extracted classifiers — it sounds like something went wrong during the extraction step. If that same taxon is being assigned to all taxa in another dataset, my guess is that only that taxon may be present in the sequences that you are using to train the classifier! So others may be dropped during extraction (e.g., because they do not match your primer). In any case, it may just be best to use the pre-trained classifiers and move on.

Low sequencing depth would do that. Looks like very low sequence depth — only a few species-level taxa are detected in a few of those samples, which is characteristic of very low sequence depth. The sequences that are detected can even be from cross-contaminants so I would recommend removing those samples from the analysis.

Are these the same exact data? Same exact sequence depth?

Not different properties of OTUs vs. ASVs, no. But different properties of denoising vs. OTU clustering methods, yes, possibly. You are probably losing lots of reads in those samples during denoising (possibly merging issues, possibly noisy sequence) that are not being removed during OTU clustering. You should go back and review your denoising results to see how many sequences are being filtered out in those samples (if you have questions about that, please open a new topic).

I hope that helps!

Dchung · July 31, 2018, 12:54pm

Yes they were the exact same files that I used for two different pipelines.

I'll create another topic for this question.

Thanks much for the help!

zArctander · August 28, 2018, 1:58pm

I met the same problem that all my features had a same taxon when using classifier trained by myself. In my case, I use a silva128 database and the repeating taxon is an Archaea too. I don't know the mechanism but after deleting the repeating taxon (for you, it's D_0__Archaea;……;D_6__hot springs metagenome ) in ref-seqs_V3V4.qza and ref-taxonomy.qza, the classifier works normally.

zyjvivien · August 29, 2018, 2:37am

I also met the same problem that all my features had a same taxon when using V3-V4 extracted classifier. In my case, I use a silva_132 database and the repeating taxon is Archaea;Nanoarchaeaeota;Nanoarchaeia;Nanoarchaeales;Nanoarchaeaceae;Nanoarchaeum;hot springs metagenome. Then I've run the same data with the pre-trained full-length 16S classifier and got a satisfying result. Maybe I should try to delete the repeating taxon in ref-seqs_V3V4.qza and ref-taxonomy.qza and run the same data to compare the results.

Cheng_Li · August 29, 2018, 5:49pm

Hello everyone,

I am a new user to qiime2. I am recently using my data to practice qiime2 code following the moving picture tutorial.

However, after putting out the taxa table, all taxa point to archaea. Completely different from what I can get from dada2 package in R.

D_0__Archaea;D_1__Nanoarchaeaeota;D_2__Nanoarchaeia;D_3__Nanoarchaeales;D_4__Nanoarchaeaceae;D_5__Nanoarchaeum;D_6__hot springs metagenome.

My previous data is from coastal sediment.

Here is how I trim my data sequence

$ qiime dada2 denoise-paired \

--i-demultiplexed-seqs D60G1a-demux.qza \

--p-trim-left-f 23 \

--p-trim-left-r 9 \

--p-trunc-len-f 295 \

--p-trunc-len-r 240 \

--o-table D60G1a-table.qza \

--o-representative-sequences D60G1a-rep_seqs.qza \

--o-denoising-stats D60G1a-denoising_stats.qza

here is how I important the reference and train the classfier.

$ qiime tools import \

--type 'FeatureData[Sequence]' \

--input-path ~/EAGCB/SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna \

--output-path silva132_99

$ qiime tools import \

--type 'FeatureData[Taxonomy]' \

--source-format HeaderlessTSVTaxonomyFormat \

--input-path ~/EAGCB/SILVA_132_QIIME_release/taxonomy/16S_only/99/taxonomy_7_levels.txt \

--output-path silva132_99_ref_taxonomy

$ qiime feature-classifier extract-reads \

--i-sequences silva132_99.qza \

--p-f-primer CCTACGGGNGGCWGCA \

--p-r-primer GACTACHVGGGTATCTAATCC \

--o-reads ref_seqs

$ qiime feature-classifier fit-classifier-naive-bayes \

--i-reference-reads ref_seqs.qza \

--i-reference-taxonomy silva132_99_ref_taxonomy.qza \

--o-classifier classifier.qza

$ qiime feature-classifier classify-sklearn \

--i-classifier classifier.qza \

--i-reads D60G1a-rep_seqs.qza \

--o-classification taxonomy.qza

$ qiime metadata tabulate \

--m-input-file taxonomy.qza \

--o-visualization taxonomy

$ qiime tools view taxonomy.qzv

Nicholas_Bokulich · August 29, 2018, 6:27pm

Hi @Cheng_Li,
I reassigned your post to this topic, since it is an identical error to these other users'. Thank you for posting extensive details on your workflow!

Thank you @zyjvivien and @zArctander for reporting! And thank you @zArctander for posting your solution.

This is very bizarre and I am not really sure what's going on. There are clearly a few commonalities (we have never seen this error before, and not with earlier versions of SILVA, etc). All of you used:

SILVA 132
V3-V4 primers (it looks like the same sites, but different degeneracy levels)
hot springs metagenome is the problem every time
evidently, removing hot springs metagenome solves the issue

Does someone want to post the sequence for hot springs metagenome here? Perhaps it has a very large number of Ns and hence all sequences really are choosing that as the top hit (seems unlikely but still, worth a look).

Thanks!

Cheng_Li · August 29, 2018, 7:01pm

Thanks! Nicholas! Looking forward to resolve this problem.

William · August 30, 2018, 10:20am

I think the issue is an aberrantly short sequence being created during the extract-reads step.

There is only one taxonomy string that matches the one in question:
LAFJ01000960.5.938 D_0__Archaea;D_1__Nanoarchaeaeota;D_2__Nanoarchaeia;D_3__Nanoarchaeales;D_4__Nanoarchaeaceae;D_5__Nanoarchaeum;D_6__hot springs metagenome

And here is the 4 nucleotide sequence that is present for it in my extract-reads artifact file (the original full length read doesn't have any long stretches of N characters, so it appears to be an odd primer match location for the V3V4 primers):

LAFJ01000960.5.938
AAAG

I tried rerunning the qiime feature-classifier extract-reads command with a stricter setting of --p-identity 0.90 (default is 0.80). This resulted in the >LAFJ01000960.5.938 sequence not showing up in the artifact file (the total reads went from ~735K to ~724K). This may be a potential solution, can you try this approach and see if it resolves the taxonomy issues?

Perhaps there needs to be a minimum/maximum length setting (there's a read with 1878 BP, >AB302407.1.2962, in the .80 output artifact too) for extract-reads? An alternative approach I've taken to extracting reads between priming sites is to take the mode of the position where the primer matches in the (non-destructive!) alignment to slice out the region between the primer binding sites, and degap these reads to avoid issues with strange binding sites for poorly-matching primers. Of course, issues with the alignment itself could introduce other problems, so there may not be a perfect solution.

Nicholas_Bokulich · August 30, 2018, 4:42pm

Thanks @William!

I think that sounds like the best solution at the moment. We may consider changing that default setting if this is a persistent problem.

Thank you for the suggestion! I have raised this issue to track this.

I like that idea — if you are interested in contributing your code to q2-feature-classifier you could add that as another method and I would be happy to help.

Cheng_Li · August 30, 2018, 4:45pm

Hi @Nicholas_Bokulich ,

I have tried to rerun my workflow by using SILVA 128. The similar problem showed up as well.

Thanks @William, though I do not quite understand :/. But I will try to rerun my workflow by using "--p-identity 0.90 ". I thought about if this feature-classifier extract-reads might the the step that went wrong. Because of the rep-seq.qzv looks quite normal, when was blasted in NCBI, it actually came up with reasonable matches, like 16s rRNA of marine bacteria.

Nicholas_Bokulich · August 30, 2018, 4:48pm

Re-run extract-reads like so:

qiime feature-classifier extract-reads \
  --i-sequences silva132_99.qza \
  --p-f-primer CCTACGGGNGGCWGCA \
  --p-r-primer GACTACHVGGGTATCTAATCC \
  --p-identity 0.9 \
  --o-reads ref_seqs

That should fix it!