How to add taxonomy to feature-table.qza from open-reference-otu-picking

Hello,

I have bees trying the different methods to taxonomic classifiers in Qiime2. I have data from PGM Ion Torrent and I amplified a fragment of18S gene. I used all time Silva 128 database to compare my sequences.

My results with blast and usearch were fine, and the results with sklearn was not very good (maybe because I could not specify there that my sequences can be in plus or reverse direction form the PGM).

Then I tried open reference otu picking from Vsearch, and I got my feature table. Following the recommendations of other topic in this Qiime2 forum (closed now) I convert my feature table to biom and then to tsv, in order to check how is the table and how much similar is respect to the otu-table from QIIME1.

The feature table is exactly the same than in QIIME1, only change the name of he newreferences (now with a code). However, I do not have the taxonomy in my last column, after all columns of my samples, so I have the accession numbers that correspond with my sequences in Silva database but not the taxonomy (really the interesting part).

1- Could I use the comand biom-add metadata to add my taxonomy? in that case, what’s happen with the sequences called like “new sequences” (because is open-reference-analyses)?

Thank you very much for your help in advance!
MMC

Hi @MMC_northS!

Please see this post for details on how to annotate a BIOM table with taxonomy as metadata:

From the docs:

The new reference sequences. This can be used for subsequent runs of open-reference clustering for consistent definitions of features across open-reference feature tables.

Hope that helps!

HI @thermokarst

thank you for your answer.

so, I can export my feature table to .biom format and working with QIIME 1. But to add the taxonomy before expported, do I need to run one classifier method (Vsearch, BLAST, …) in order to add taxonomy to my feature table from the open-reference otu-picking?

Other question that I had while I was running these steps is:

If I want to do open-reference otu picking, previously, can/should I run one of these methods: dereplication (pluging vsearch), deblur, or dada? Taking into account I am using Ion Torrent PGM platform.

Thank you for your help!
MMC

How do you know the results are good or bad? If you have a mock community or simulated sequences for which you actually know the correct taxonomy assignment, then it is possible to ascertain how well a taxonomy classifier is actually performing.

If you are assessing “goodness” using real samples, then you don’t really know the true composition. The taxonomic assignments that most of us like to see (species-level classification) may not actually be reliable, and can frequently be misclassified.

The naive bayes classifier in QIIME2 is set to provide fairly conservative classifications, in order to avoid false-positive errors (other classifiers can be prone to high false-positive error rates), but you can lower the confidence parameter to get deeper classification (with a slightly higher false-positive error rate). The qiime1 uclust classifier performs pretty well but can give false-positives. qiime1’s BLAST classifier gives very high false-positive rates at species level (since it always just reports the top hit). So unless if you are testing this on mock communities or other samples with known composition, then whatever qiime1 blast is telling you is probably wrong…

You cannot add taxonomy before exporting. You would export the feature table and taxonomy separately, then merge outside of QIIME2 using biom. In QIIME2 feature tables and taxonomy are always kept separate so that it is always clear where your feature metadata (e.g., taxonomy) is (it is never stored in a feature table).

To do open-ref otu picking, you must first do dereplication (I am assuming you already saw this in the otu picking tutorial). It is also possible to use deblur or dada2 prior to otu picking, in order to use their denoising methods, though then otu picking just eliminates useful information. OTU picking is sort of a crude denoising method — dada2 and deblur are more effective at removing errors, in my experience.

Hi @thermokarst!

Thank you for the information! That worked really good!

@thermokarst what do you think about this part?

Please read @Nicholas_Bokulich response above - he addressed this question.

Hello @Nicholas_Bokulich

yes, I am testing the clustering OTUs methods and classifiers in QIIME2 with mock communities analysed in Ion Torrent PGM. When I said “not really good results” is because I cannot find the species that I used in the mock community. I have several replicates and Two DNA extration methods that I know extract good quality DNA in thse mocks.

1 Like

Thanks for clarifying! Just making sure you have an appropriate standard here (many users have had the same questions, but are using real samples with unknown composition).

In that case, I’d recommend trying different confidence parameter settings with the naive bayes classifier to see what works for you. The default setting is fairly conservative/cautious and a “rough average” across many mock communities and data types. So since you have mock communities, you can use them to re-tune this classifier for your specific conditions, and very likely increase the accuracy (especially if other methods like uclust do get better results, that’s a pretty good indicator that more tuning would help).

There are also many other parameter under the hood, especially in how the classifier is trained, that could be adjusted here if you want to check out that pre-print and do a full parameter sweep to carefully tune to your data. That is probably much more work than would be worthwhile if you are looking for quick answers, but something to consider if you are trying to re-optimize for your data type.

Good luck!

@Nicholas_Bokulich ok, so I will try all the options and combinations. For the moment I am testing this three options:
1- deblur, open-ref-otu-pick, classifier (several methods)
2- dada2, open-ref-otu-pick, classifier (several methods)
3- dereplication open-ref-otu-pick, classifier (several methods)

Since your comments maybe I can also try to do:
4- deblur, classifier (several methods)
5- dada2, classifier (several methods)

My next question is, for these two options, whats happend with the haplotypes (OTUs) that you do not have in databases? The open-reference allow to take into account those reads but if I run only dada2 or deblur I will take into account them or not?

My samples were amplified with 18S marker, which we know that also amplify some fragments of 16S prokariots. It is because in our mocks have two eukariote species and one prokariote. In a first way I tried the options that you pointed:
dada2 + classifiers
deblur + classifeirs
against 18S Silva database. I only found one of the eukariote species despite of the other one is present in the database. Maybe this is a consequence that from Ion Torrent PGM you can have your reads per sample in both directions, forward and reverse complement. I do not know. It is because in the classifiers I prefer Blast or Vsearch because you have the option of both directions to test.

Now i am trying the options to create OTUs (or variants) against the complete Silva database (16S + 18S).

1 Like

:+1:

dada2 and deblur will retain these if they are not too error-filled. These methods are effectively database-agnostic so unlike reference-based OTU picking they do not care about how similar the sequence is to a database.

(this is not entirely true for deblur: it does a pre-filter to toss out sequences that have very low similarity to a reference database — 88% I believe — to remove sequences that are almost certainly sequencing artifact. But that is not really related to your concerns here)

If you were only seeing this issue with deblur, I would suspect that the pre-filter was the issue, and suggest using SILVA’s 16S + 18S database for the prefilter (sounds like that’s already your plan). But you are also seeing this with dada2, which should be entirely database agnostic. So not seeing 16S reads probably means that any that are present are determined to be error. (it is possible that dada2 is overly stringent at times — your analysis should help you decide).

I am not sure whether read direction is an issue with the naive bayes classifier. @BenKaehler might have some ideas about this. It looks like there is a read-orientation parameter and read orientation is autodetected by default, which may imply that reads are only processed in one direction. So having a mixture of read orientations could be problematic — especially if you are getting notably better results with blast/vsearch consensus classifiers, then I think your results are a pretty clear indicator (we have not run into this issue before since we are mostly dealing with Illumina data with one read orientation).

If read orientation is a problem, for now the only way to really compare naive-bayes performance would be to split your query sequences into two batches based on read orientation.

An off-topic reply has been split into a new topic: How to create a feature table with taxonomy feature metadata

Please keep replies on-topic in the future.

HI @Nicholas_Bokulich

I have tried the options that we spoke last time:

  1. deblur + classifier
  2. dada + classifier
    without to do open-reference in any moment.

For the option 1. was the best but not completely success. Deblur was used with a trim-length (because I can use -1 option due to give error because the sequences from 18S and 16S do not have the same length), and the rest parameters by default. The comparison was done against the complete Silva 128 database (18S + 16S) in order to include the 16S sequences that we know that we have. The classifier method was Vsearch because was the best result when I tried the same but with only 18S database (remembering: BLAST was similar but little bit worse and sklearn was completely failed). The classifier was run with 90% identity and 51% consensus with reference (by default).
The result was more or less good for one of the genus that we used but for the other one was detected in really low proportion and only at order level or upper.

For the option 2. dada + classifier Vsearch gave really low features number. The parameters were --p-trunc-len 0 (I understand no trimming) --p-trunc-q 20 (I understand base quality Q=20).

Under these results i can continue using deblur, but I want to improve the taxonomic classification. Do you know what can I do? Because other private companies using my data got a better classification so I know is possible. Maybe the problem for the genus that was worse is the trimming? I mean, maybe with the length that I used the specificity to assign the genus or the order is not enough? Do you know if I can use in qiime2 any other form to use in the Vsearch or BLAST my complete sequence, without trimming? maybe more bases result in more information and better taxonomic classification.

In other way, I want to test the open-reference otu-picking with 99% identity but I do not know if I can use it without previous dereplication? For otu-picking I need the sequences and the table. I have the sequences filtered by quality with “qiime quality-filter q-score” so can I use that sequences_filtered.qza and demux-table.qza? or is it mandatory other step?

Finally, do you know any about the orientation of the reads in the same fastq file?? @Nicholas_Bokulich, @BenKaehler

Thank you very much for your great help!
Marta.

That makes sense since you have 18S sequences in there.

Completely failed at species level? Or at all levels? Species level could make sense — the inclusion of 16S and 18S together could be reducing the diagnostic power of that classifier — but if this is failing completely then it probably indicates an issue with how the classifier was trained.

Try different parameter combinations for the classifiers. Since you have a mock community you can tune this a bit to optimize.

What companies? You are probably constrained by database quality here — most companies are using carefully curated reference databases, particularly for things like pathogens, which can sometimes be noisy/misclassified in public reference databases. If you could get a hold of a curated reference database for your species of interest, that could help improve precision.

Trimming could certainly be the issue — classifying only at order level would seem to indicate a lack of information, e.g., on a very short sequence.

The trimming is happening during denoising. So yes, you could try reducing the trimming by changing those parameters — but keep a close eye on the sequence yields. If these drop off dramatically, then you have reduced your trimming too much and low-quality reads are passing through, causing the whole read to be removed during denoising.

Indeed — but it is a tricky balance, per my comment above.

Yes, you must first dereplicate with q2-vsearch. I can almost guarantee that OTU picking results will not be as good as denoising…

It would also be possible to just dereplicate and use those sequences (without OTU picking). Though I would suggest that you at least filter these sequences to remove any observed at a low frequency (< 20? < 100? depends on how many total sequences you have and how stringent you want to be)

I think that mixed read orientation will cause an issue, but let’s see what @BenKaehler has to say about this. If this is a limitation, you could devise a way to scan your fasta and reverse-complement any reads that are in the reverse orientation.

I hope that helps!

ok, I understand, but, if I have used previously a filter to Q=20, supposedly the sequences that were kept in the output and which are now my input to deblur have good quality. If the mean length of my reads is 117 bp could I use that length to try including the maximum number of reads in my analyses?

ok, I am waiting for @BenKaehler opinion. I think that if the direction is a problem, the problem come from the denoising, because in the classifier you already have the option to try both directions (–p-strand comand) so I am very interested in knowing the response.
Thank you