"Same" database, different outputs

Hi,
I'm using qiime2 for my taxonomic analysis on ITS of Fugi with UNITE 3.8 dynamic as database.

Keeping the other parameters constant, I tried to compare the results obtained using 3 different classifiers obtained from:

-Unite 3.8 to Eukaryotes PlutoF biodiversity platform
-Unite 3.8 Fungi PlutoF biodiversity platform
-The same Unite 3.8 Fungi PlutoF biodiversity platform, but trimmed with "qiime feature-classifier extract-reads" with the primers sequences.

Here the obtained taxa-barplots.qzv...... Fungi_primer_trimmed.qzv (1.3 MB) Fungi.qzv (1.1 MB) Eukaryotes.qzv (381.4 KB)

Since the results are different from each other, which method should I consider more correct?

Thanks

AC

Sorry, the names of the files are inverted.

Eukaryotes.qzv is obtained from "Unite Fungi database" trimmed
Fungi.qzv is obtained with "Unite Fungi database" untrimmed
Fungi_primer_trimmed.qzv is obtained from "Unite Eukaryotes database"

Hi @Andrea_Colautti ,

use the UNITE eukaryotes database without trimming.

Why No Trim: UNITE contains a mix of ITS sequences that use different primers and even target different subdomains (ITS1 vs. ITS2), so many are not full-length ITS. Extracting the amplified regions will lead to losing sequences that might be the correct domain but already have the primers trimmed. We have a note about this in this tutorial.

Why Eukaryotes: using the eukaryote database can marginally decrease accuracy but in your case you clearly have some non-fungal hits, e.g., to plants and metazoa. At the very least, use the fungal database but add in expected non-target hits (e.g., host ITS sequences). Otherwise a classifier that is only trained to identify fungi will only know about fungi, and be unable to identify non-fungal sequences. Ideally, this would lead to no classification and you can just remove the unclassified reads. But without some non-target hits in the reference, many non-fungal reads will be classified as fungi because this is the only kingdom represented.

There are ways to test this quantitatively (e.g., with known reference standards) but without such a standard I would personally trust the most complete database (untrimmed eukaryotes).

3 Likes

Thanks for the help!

I had read the advice to trim the database from the following link:

To trim or not to trim

One issue with ITS (and other marker genes with vast length variability) is readthrough, which occurs when read lengths are longer than the amplicon itself! The polymerase will read through the amplicon, the primer, the barcode, and on into the adapter sequence. This is non-biological DNA that will cause major issues downstream, e.g., with sequence classification. So we want to trim primers from either end of the sequence to eliminate read-through issues. Enter cutadapt. Note that we trim the forward primer and the reverse complement of the reverse primer from the forward reads (the forward primers have already been trimmed in the raw reads, but we will demonstrate forward + reverse trimming here since attempting to trim the forward read will not hurt). We trim the reverse primer and reverse complement of the forward primer from the reverse reads.

In this analysis, I am interested only in the classification of Fungi.

Should I therefore eliminate the "not Fungi" from a classification made with an All Eukaryotes Database despite the lower precision?

For sequences classified as "k__Fungi; __; __; __; __; __; __" can I consider them correctly classified as Fungi, or should we consider them as possible random matches and merge them with unidentified?

Thanks!

1 Like

Hi @Andrea_Colautti ,

Yes. The lower precision is mostly at deeper taxonomic levels ā€” a sequence classified as something other than fungi is quite clearly non-fungal in that case. So remove if you are only interested in the fungal communities.

I would usually discard these ā€” classifications that do not reach at least phylum level are usually very weak hits and probably are not fungi either (e.g., maybe chimera that slipped through). But there has been a lot more discussion around this on the forum, and other steps for troubleshooting (e.g., manually BLAST some of these to check) so you can search the forum archive for more specific details, and you can use q2-quality-control to assess the similarity to fungal sequences and filter based on quantitative criteria...

good luck!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.