barplot from de-novo OTUs: ValueError: Feature IDs found in the table are missing from the taxonomy

Hello everybody!
I’m trying different strategies in clustering features in my dataset: closed, open reference and de-novo.
I previously imported and created the required taxonomy.qza file starting from the latest version of GreenGenes database.

I filtered chimeras and then wanted to create a barplot for each of the clustering, to compare results and perform taxonomic analysis.

Everything worked fine for closed-reference analysis, but with denovo and open-reference ones it didn’t. In the case of de novo one, I used the following code:

qiime taxa barplot --i-table uchime-dn-out_cr_new/table-nonchimeric-wo-borderline.qza --i-taxonomy gg_13_8/gg_13_8_otus/taxonomy.qza --m-metadata-file metadata.tsv --o-visualization taxa-bar-plots_cr_concloroplasto.qzv

And this is what I got: ValueError: Feature IDs found in the table are missing from the taxonomy.

Here is the whole log.

I think I’m missing something very basic here, can anyone help? It’s an open reference/de novo analysis but the script still asks me for some taxonomy file?
Should I use a different one?

Thanks in advance!

I’m running QIIME2 2019.7

Hi @Sparkle,

it looks like you importing the GreenGenes database and using the obtained qza object for the taxa barplot.
If so, the missing point is the taxonomy assignment of your de-novo clusters. Please refer to the ‘Taxonomy assignment’ paragraph in the ‘Moving picture tutorial’ for more info, but briefly you should:
a) train your database for the 16S region you are investigating
b) Assign taxonomy to your de-novo clusters using sk-learn
c) use the result of (b) in your ‘qiime taxa barplot’ command

Hope does make sense!
Luca

3 Likes

This is exactly what I wanted to know, thank you! I had figured out the problem was the absence of taxonomy assignments, but didn’t know how to provide a de-novo taxonomy.

I’ll check out and try what suggested in the tutorial now, thanks!

1 Like

Sorry for answering again, but I can’t figure out, from the tutorial, how to perform the b) step.

I retrieved these key instruction from the tutorial, but I can’t figure out where my specific de-novo clustered reads have to be used, because it takes as input files generated from my Greengenes database, not from my own dataset (classifier.qza and rep-seqs.qza).

Maybe I have to perform it using my output rep-seqs from de-novo clustering as input (–i-reads) ?

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path 85_otus.fasta
–output-path 85_otus.qza

qiime tools import
–type ‘FeatureData[Taxonomy]’
–input-format HeaderlessTSVTaxonomyFormat
–input-path 85_otu_taxonomy.txt
–output-path ref-taxonomy.qza

qiime feature-classifier extract-reads
–i-sequences 85_otus.qza
–p-f-primer GTGCCAGCMGCCGCGGTAA
–p-r-primer GGACTACHVGGGTWTCTAAT
–p-trunc-len 120
–p-min-length 100
–p-max-length 400
–o-reads ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs.qza
–i-reference-taxonomy ref-taxonomy.qza
–o-classifier classifier.qza

qiime feature-classifier classify-sklearn
–i-classifier classifier.qza
–i-reads rep-seqs.qza
–o-classification taxonomy.qza

qiime metadata tabulate
–m-input-file taxonomy.qza
–o-visualization taxonomy.qzv

Thanks in advance!

qiime feature-classifier classify-sklearn
–i-classifier classifier.qza
–i-reads rep-seqs.qza
–o-classification taxonomy.qza

here - the classifier is from greengenes (your trainer)
“–i-reads” is the representative sequences (from DADA2 or denoise)

Specific steps:

You import the entire sequence from 85% homology to greengenes 13_8 here (full length fasta for 16S)

qiime tools import
–type ‘FeatureData[Sequence]’
–input-path 85_otus.fasta
–output-path 85_otus.qza

Here, you are importing the taxonomy assigned to the 85% (kingdom, phylum, etc.)

qiime tools import
–type ‘FeatureData[Taxonomy]’
–input-format HeaderlessTSVTaxonomyFormat
–input-path 85_otu_taxonomy.txt
–output-path ref-taxonomy.qza

Here you are extracting the forward and reverse sequences from the full length 16S greengenes sequence. You are grabbing everything between your sequences. Essentially you are grabbing the matching sequences with 85% homology for YOUR particular V region on the 16S.

qiime feature-classifier extract-reads
–i-sequences 85_otus.qza
–p-f-primer GTGCCAGCMGCCGCGGTAA
–p-r-primer GGACTACHVGGGTWTCTAAT
–p-trunc-len 120
–p-min-length 100
–p-max-length 400
–o-reads ref-seqs.qza

Here you are taking those sequences and training a classifier (classifier.qza) that has your 85% homology to the fasta sequences and then it also has the taxonomy assigned to that sequence.

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs.qza
–i-reference-taxonomy ref-taxonomy.qza
–o-classifier classifier.qza

1 Like

Thank you! Is it wrong then to use the rep-seqs resulting from de-novo clustering?

1 Like

No, I don’t think so, isn’t DADA2 considered a De Novo clustering method? It’s done without reference to any sequences and generates ASVs. Ben

1 Like

Yes, I had figured out, I was doubtful about sklearn in particular!
Thank you for the detailed answer, though!

1 Like

https://benjjneb.github.io/dada2/SotA.html

Benjjneb considers DADA2 a de novo method. If you are picking against a reference database - let’s say against RDP or SILVA using VSEARCH method built into QIIME2 it would be considered a closed/open clustering method. Since DADA2 uses no reference and only assigns taxonomy at the end of the clusters it should be considered de novo.

But what do I know? it’s all latin to me. Ben

1 Like

I’m a total beginner… but I used dada2 for denoising, and then vsearch for clustering.

More specifically, here are some specific steps of my pipeline (including closed-reference analysis)

Denosing and quality filter with dada2

qiime dada2 denoise-pyro --i-demultiplexed-seqs sequenze/single_end2.qza --p-trunc-len 0 --output-dir denoise --verbose --o-denoising-stats denoise/denoising_stats.qza --o-representative-sequences denoise/representative_sequences.qza --o-table denoise/table.qza

Dereplicating my sequences

qiime vsearch dereplicate-sequences --i-sequences sequenze/single_end2.qza --o-dereplicated-table denoise/dereplicated_table.qza --o-dereplicated-sequences denoise/dereplicated_sequences.qza

Training my database

qiime tools import --type ‘FeatureData[Sequence]’ --input-path rep_set/99_otus.fasta --output-path 99_otus.qza

qiime tools import --type ‘FeatureData[Taxonomy]’ --input-format HeaderlessTSVTaxonomyFormat --input-path taxonomy/99_otu_taxonomy.txt --output-path ref-taxonomy.qza

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads 99_otus.qza --i-reference-taxonomy ref-taxonomy.qza --o-classifier classifier.qza

qiime feature-classifier classify-sklearn --i-classifier classifier.qza --i-reads 99_otus.qza --o-classification taxonomy.qza

qiime metadata tabulate --m-input-file taxonomy.qza --o-visualization taxonomy.qzv

De-novo clusterization

qiime vsearch cluster-features-closed-reference --i-table denoise/dereplicated_table.qza --i-sequences denoise/dereplicated_sequences.qza --i-reference-sequences gg_13_8/gg_13_8_otus/99_otus.qza --p-perc-identity 0.99 --o-clustered-table table-cr-99_new.qza --o-clustered-sequences rep-seqs-cr-99_new.qza --o-unmatched-sequences unmatched-cr-99_new.qza --p-strand both

As you can see, dada2 generated

denoise/representative_sequences.qza

And vsearch rep-seqs-cr-99_new.qza

You said to use the first one, but I thought the second one was a better choice. Not entirely sure, though.

1 Like

Ok VSEARCH is actually a clustering method - here I think you’re actually re-clustering the DADA2/denoise output using a 99% score to the 99_otus fasta sequences. I actually used to do it thise way in order to filter out anything that didn’t match 16S, but I think you’re actually taking your DADA2 ASV (removed clusters) and putting it through a closed reference picking/clustering method.

qiime feature-classifier classify-sklearn --i-classifier classifier.qza --i-reads 99_otus.qza --o-classification taxonomy.qza

This is interesting, you actually are taking the 99_otus.qza and outputing the taxonomy. Replace 99_otus.qza with this:

–o-representative-sequences denoise/representative_sequences.qza

From your DADA2/denoise

Here’s my workflow:

Raw fastq
V
Import into QZA
V
Demux files
V
DADA2/Denoise
V
Tree creation
V
Taxonomy assignments < Make a classifier for 97%/99%
V
Taxa bar plots

Your pipe is something like this:

Raw fastq
V
Import into QZA
V
Demux files > Closed VSEARCH > Taxonomy assignments > Taxa bar plots
V
DADA2/Denoise
V
Tree creation
V
Taxonomy assignments < Make a classifier for 97%/99%
V
Taxa bar plots

You could potentially do it both ways and compare the DADA2 clustering vs. VSEARCH (vsearch should not be considered a de novo clustering method - it is a separate clustering method) clustering, but you can see you’re doing both.

Ben

2 Likes

Hi @Sparkle,

I’m not sure about your pipeline, I am worried that you loosing the total count of the sequences in the final abundance table because the dereplication step (but I admit I never tried this pipeline so I may be wrong …).
I would follow @ben suggestion and avoid the ‘dereplication’ and ‘qiime vsearch cluster-features-closed-reference’ steps

Luca

3 Likes

@Sparkle

I think also you’re confused on what files you get / clustering methods.

DADA2:

You can use the rep-seq and table for making your tree and taxonomy
No downstream modifications are necessary
With DADA2 you have to further build a classifier to assign taxonomy (you do not ncessarily need to do this with VSEARCH when you do closed picking as the database for closed picking is essentially the entire 16S gene from greengenes w/ assigned taxonomy)

VSEARCH:

Several different methods INCLUDING De-novo clustering/closed reference clustering
You take your DADA2/denoise output to then cluster to a percentage either de novo (matching by making de novo clusters/new and unique clusters by a certain percentage)
OR you can do a closed reference (which is what you did against a database)
You DO NOT NEED TO DO THIS STEP IF DADA2 RAN OK
I used this in the past to get rid of eukaryotic DNA or unclassified sequences
You would use your “hits.qza” and “table.qza” for any downstream work

Essentially:

DADA2 output itself can be used downstream OR

You can put it through another VSEARCH step to further cluster your data (this is optional, but analogous to USEARCH in QIIME1)

Ben

1 Like

In the QIIME2 basic tutorial (https://docs.qiime2.org/2019.7/tutorials/overview/) I read it recommended to dereplicate sequences before going ahead with any type of clustering, as follows:

Dereplication (the simplest clustering method, effectively producing 100% OTUs, i.e., all unique sequences observed in the dataset) is also depicted in the demultiplexing and denoising workflow, and is the necessary starting point to all other clustering methods in QIIME 2

dada2 instead is mentioned for denoising samples before dereplicating them.

So they are basically used in this order: dada2, vsearch for dereplicating, vsearch for clustering…

Thanks to both of you for your answers, I may try use dada directly then!

1 Like

This was actually…extremely useful, because I thought dada2 was only a sort of denoiser… had no idea as I said before it could perform clustering too and thought vsearch was the only option.

Sounds interesting, anyway!

1 Like

This is interesting, you actually are taking the 99_otus.qza and outputing the taxonomy. Replace 99_otus.qza with this:

–o-representative-sequences denoise/representative_sequences.qza

In the case of de-novo analysis, right?

1 Like

Yes, apologies, I see that tutorial:

  1. We cluster sequences to collapse similar sequences (e.g., those that are ≥ 97% similar to each other) into single replicate sequences. This process, also known as OTU picking , was once a common procedure, used to simultaneously dereplicate but also perform a sort of quick-and-dirty denoising procedure (to capture stochastic sequencing and PCR errors, which should be rare and similar to more abundant centroid sequences). Use denoising methods instead if you can. Times have changed. Welcome to the future. :sunglasses:

Thanks to you too, I see that the tutorial still suggests vsearch as well. I reviewed this, I think that some labs really like using closed clustering methods. We got DADA2 tuned really well for what we use it for, so we moved away from another clustering methods that use closed reference picking. Ben

edit:

qiime feature-classifier classify-sklearn
–i-classifier classifier.qza
–i-reads 99_otus.qza
–o-classification taxonomy.qza

Yes, the input reads should be your representative reads (the sequences pulled from either DADA2/denoise or Vsearch). Ben

2 Likes

Yes, this is usually necessary prior to any clustering — except since you have already denoised the sequences they are dereplicated as a part of that process. No need to run vsearch-dereplicate if you are clustering denoised sequences.

I think the tutorial text is a bit confusingly written. It is not saying you should dereplicate before or after denoising methods, rather that sentence is only meant to indicate that the upstream steps for dereplicate are shown in that section.

dada2 is not really clustering. It is dereplicating, and you can describe the error correction step as a sort of clustering, but that’s a bit of a misrepresentation/oversimplification.

I endorse @ben and @llenzi’s points:

  1. you do not need to cluster after denoising with dada2. As the overview tutorial indicates, denoising is a better method overall. Some labs do like to cluster afterwards and that is okay if that’s your lab — but I do not really recommend that, since you lose information that dada2 was designed to capture.
  2. if you do use closed-reference OTU clustering you do not need to perform taxonomy classification afterwards, or at least don’t classify the 99% OTUs reference sequences! You can just use the corresponding reference taxonomy directly.

Wait it does? That seems to be exactly opposite of the quote you posted, especially this line:

Use denoising methods instead if you can.

I would really recommend using dada2 alone without additional clustering, unless if you must for some other reason (e.g., following lab protocol or for technical reasons).

2 Likes

Sorry, I meant to say that it seems that it’s built into the pipeline for readers. It seems complicated for someone who doesn’t know the methods to use them like @sparkle did. Ben

1 Like

We always welcome contributions — the online tutorials are in spirit community-edited documents and we would welcome a pull request to the source documentation if you catch errors or recommend clarification. :wink:

2 Likes