QIIME2 Classifier Question

JamesF · April 3, 2019, 8:01pm

Hello,
I have a couple of issues that I’m dealing with:
1.) When I use the QIIME2 Developers trained Silva 132 99% OTUs (full-length, seven-level taxonomy classifier) I get classifications of some of my OTUs belonging to Aspergillus. However, whenever I take that OTU and BLAST it using the NCBI database, it comes back as Byssochlamys. I additionally took the OTU and imported it into Silva directly and noticed the % similarity was 96.5% to Byssochlamys and was wondering if that impacted the QIIME2 Developers classification. I figured that this might be because I need to train a new classifier using the primer pair’s that are used in my lab. Would training a new classifier help me hopefully resolve this classifier issue?

2.) The primer’s that I used encompass the three domains of life and I have been having trouble with the classifier correctly classifying Eukaryotic OTUs. I downloaded Silva’s 132 database and finished training a new classifier using Silva132_99 major_taxonomy_7_levels. Unfortunately, when I used this classifier to assign taxonomy on my data set, it classified my OTUs as metazoans which was even further off from the QIIME2 Developers classifier. I’ve included my code for training the classifier below and was hoping to get some input on what I can do to help develop a better classifier. I trained this classifier using QIIME2-2018.11 using Linux OS.

Code:

qiime tools import --type 'FeatureData[Sequence]' --input-path silva132_99.fasta --output-path silva132_99.qza

qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat --input-path majority_taxonomy_7_levels.txt --output-path ref-majority7lvls.qza

qiime feature-classifier extract-reads --i-sequences silva132_99.qza --p-f-primer CCGTGYCAGCMGCCGCGGTAA --p-r-primer CCGYCAATTYMTTTRAGTTT  --p-min-length 100 --p-max-length 400 --o-reads extractedref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads extractedref-seqs.qza --i-reference-taxonomy ref-majority7lvls.qza --o-classifier ParadaMajorTaxaClassifier.qza

I appreciate any feedback!

thermokarst · April 4, 2019, 4:34pm

I’ll let our classification expert @Nicholas_Bokulich answer this, but in the meantime, have you reached out to the SILVA team? Unless I am misunderstanding, it sounds like the root of your question has to do with the specifics of the SILVA database, which we don’t develop.

Nicholas_Bokulich · April 4, 2019, 9:26pm

That's not too similar. How confident are you that NCBI BLAST is giving you the correct answer? The sequence there could be misannotated...

How similar is your sequence to aspergillus sequences in SILVA? If greater than 96.5%, then it sounds like Aspergillus is correct. If < 96.5%, then chances are what you have is neither of those genera (or sequence error is muddling the similarity) but the kmer profile looks most similar to aspergillus.

Examine (export) the output of that command, and maybe restrict the min and max length parameters to filter out the output. It is possible that your primers are not hitting many of the SILVA sequences, leading to a skewed reference database; so see how many sequences are in the output vs. the input. Also determine whether the lengths of these amplicons make sense: very short reference sequences can lead to invalid results like you are seeing.

Keep us posted!

JamesF · April 5, 2019, 4:18am

Hey Matthew and Nicholas,
I appreciate the replies! I’ve just finished looking at the % similarity of Aspergillus (95.5%) and Byssochlamys (96.6%) and the OTU is more similar to Byssochlamys using the Silva database.

I’m currently looking at the classifier I trained, but I was wondering if using a trained classifier with my primers would help resolve the taxonomy, or if using the QIIME2 trained classifier is going to give similar results since they are both using the same database. I’ll stay in touch after I get a chance to sort through input and output files I’ll post the results!

Nicholas_Bokulich · April 5, 2019, 12:32pm

Primers definitely impact this process, since it impacts what you hit. E.g., if your primers hit Byssochlamys but not Aspergillus, that will obviously rule out Aspergillus. Even if it hits both, this could help improve classification since it is based on kmer frequencies, not on alignment, and so using the full-length 18S kmer frequencies may for some reason make this look more like Aspergillus.

JamesF · April 5, 2019, 4:55pm

I just finished looking at the total sequences in my extracted sequences and I have 425098 sequences in the Silva132_99 database and 366840 in my extracted_seqs folder. The sequence length statistics for the extracted_seqs are:

seq count = 366840
min length = 100
max length =400
mean length = 373.16

My amplicons for this pair are normally 239 bp. If I were to raise my min length up to 200 would that hopefully help.

Nicholas_Bokulich · April 6, 2019, 12:17pm

That's good — extraction is not dropping too many. It would be worth seeing if Byssochlamys and aspergillus seqs are in there if these are important taxa to you.

Do you expect that much variation? It would be worth setting these limits to whatever range you expect. We usually only see impacts on results with very short amplicons, but it never hurts to weed out more false-positive amplicons (you can also adjust the mismatch tolerance threshold to control for this).

Lichen · April 18, 2019, 3:36pm

Mycologist here -

Just wanted to bring to the table the possibility that 18S (i.e., particularly this short fragment of it), may not have sufficient information such to resolve beyond Class or Order for this group. I’m not an expert on Eurotiomycetes, but for example for Sordariomycetes such as Xylaria and close relatives, this can be the case.

Typically I would recommend to use higher-than-genus-level classification based on anything other than multilocus analysis for Ascomycota, which doesn’t help much here because the two genera in question are also in unique families.

A quick look into the history of those groups indicate (unsurprisingly for Fungi, unfortunately) that it may be possible that sequences in public databases are indeed incorrectly annotated (e.g., Byssochlamys appears to be one teleomorph for Paecilomyces spp., which have been shown to be polyphyletic based on 18S).

system · May 19, 2019, 9:36pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.