Only one feature from 18S classifier

Ellenphant · April 18, 2019, 2:46am

Hello QIIME2 friends,

I recently got back 18S data and made an 18S classifier (currently using 2018.11) but when I use it all of my features are classified to the same organism! It seems to be a very similar issue to this post from last year. I had someone using an older version of qiime that has their own 18S classifier try classifying it, and the features were all classified differently (as expected from an environmental sample!) so I think the problem lies in my classifier.

I saw that the 'easiest solution' posted in that other thread is to use the full length 18S classifier but I wasn't sure what that means exactly? Can someone explain the process of creating a full length 18S classifier? Sorry if this is a stupid question!

Nicholas_Bokulich · April 18, 2019, 12:38pm

Hi @Ellenphant,
This is a problem with the classifier that you trained, and as you have read this occurs when unusually short sequences are used to train the classifier — the classifier gets confused!

You can use the min-length and max-length parameters with the extract-reads method to prevent including very short sequences when you are trimming your sequences prior to training. (I am assuming you used extract-reads without those settings, and that is where the short sequences are coming from)

What organism is that? You can use qiime taxa filter-seqs to filter the sequences you used for classifier training, then export the filtered sequences to see if any are unusually short. This will confirm that you are suffering from the same issue, and the proposal above will fix this.

Not a bad question! You can either train your classifier on sequences that have not been trimmed to the amplicon site of interest, or just use the pre-trained SILVA classifier that we provide on the QIIME 2 website.

But extracting reads and adjusting the min-length and max-length parameters appropriately will be a better solution.

Let us know if that helps!

Ellenphant · April 18, 2019, 1:52pm

Everything is currently being classified as Metschnikowia krissii. I looked it up on SILVA and it has a sequence length of 1727?

Yes, the first time I did not use the min-length and max-length parameters. Is there any trick to choosing values for it?

Great, so for doing the full length classifier would that mean I just don't input the primer sites?

Thanks for all the help I will get to work trying things! I was suspicious of the classifier this time because when I trained it it only took 30 seconds...and normally it takes a biiiiiit longer to do.

Nicholas_Bokulich · April 18, 2019, 2:00pm

That's the full-length sequence. What about after you have trimmed the sequences with extract-reads? I suspect it is a poor match for your primers and you are getting very short Metschnikowia krissii sequences out the other end.

Another trick is to adjust the --p-identity parameter (e.g., maybe set to 0.9) to reduce the mismatch tolerance for primer binding.

The trick is to know what your expected amplicon length is, and the amount of variation you would expect (if you do not know, give it a margin of maybe 100 nt to either side of the mean amplicon length)

No, just do not use extract-reads. Use the full sequences.

Oh wow that is usually a bad sign if training time decreases... check your command carefully to make sure you are using the correct primers, etc... this might not be an issue with including very short sequences; rather, extract-reads may be excluding most sequences either because they are poor hits (e.g., you are inputting the wrong primers) or because the length should actually be much longer than the default max-length parameter setting.

Give all of that a try and let us know what you find...

system · May 19, 2019, 8:00pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.