Unable to classify reads

shashankgpt · January 27, 2020, 3:47pm

Hi,
We have done a small analysis based on different primers (V1V3: 27f534r and V3V4: 341f806r) using 2*300bp Illumina sequencing and here are the top genera-

Genus	V1V3 (%)
Cutibacterium	23.7812
Kingdom_Bacteria	14.2746
Micrococcus	4.5218
Family_Corynebacteriaceae	4.4657
Family_Moraxellaceae	4.2813
Family_Deinococcaceae	3.5192
Class_Oxyphotobacteria	2.8539
Kocuria	2.7528
Family_Xanthobacteraceae	2.6409

Genus	V3V4 (%)
Escherichia_Shigella	7.753
Staphylococcus	7.392
Acinetobacter	6.8283
Cutibacterium	4.9483
Family_Moraxellaceae	4.6208
Micrococcus	4.519
Streptococcus	4.0931
Corynebacterium_1	3.9415

I was just wondering, why we are getting a high percentage of unclassified Bacteria for V1V3 primers and majority of the top genera were unclassified Family or Class.

For this, we have used the Silva database 132 release.

jwdebelius · January 27, 2020, 3:50pm

Hi @shashankgpt,

What classifier did you use? Did you use one of the ones off the site and if so, which, or did you train your own?

Best,
Justine

shashankgpt · January 27, 2020, 3:53pm

We used Naive Bayes classifier, and train our own using our own primers.
Best,
Shashank

jwdebelius · January 27, 2020, 4:27pm

Hi @shashankgpt,

Okay, then that decrease the probability of miss classification. Have you also double checked that you've used the correct classifier?

What does your confidence look like on your assignments in each region?

Best,
Justine

Rosie · January 28, 2020, 12:44pm

Hello Justine,

I am working with Shashank.
I have checked the confidence file for V1-V3 primers and the lowest value is 0.81 for an ASV assigned to Kingdom Bacteria.
We have checked the classifier and it is the correct one.
Best,

Rosangela

jwdebelius · January 28, 2020, 12:48pm

Hi @Rosie,

Welcome! Thanks for double checking for me. I have been known (multiple times) to mix up my classifiers and so verification helps. It sounds like it's classifying accurate. My guess is that the V1-3 coverage in Silva isn't as good, but I'm going to tag I @SoilRotifer who is the Silva Expert.

Best,
Justine

shashankgpt · January 28, 2020, 1:11pm

Hi @jwdebelius,

While training the classifier, I got an warning-

Saved TaxonomicClassifier to: silva_V1V3_classifier.qza
/mibi/users/jsb562/miniconda2/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_feature_classifier/classifier.py:101: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.19.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)

And the scikit-learn I am using is-

conda list
scikit-learn 0.19.1 py35_nomklh26d41a3_0

I suppose this is not a problem here, Right?

jwdebelius · January 29, 2020, 9:14am

Hi @shashankgpt and @Rosie,

No, the classifier warning isn't the issue, that just lets you know that you won't be able to reuse the classifier with another version of sklearn easily. (I tend to re-train my classifiers).

A few colleagues have suggested other potential issues and solutions.

First, what happens when you blast some of the unclassified sequences? How do they look in NCBI?

What did your primer removal look like? How did you filter nonbiological sequences (PhiX, etc?)

Have you tried looking at what happens with the full length classifier from the data resources page?

Best,
Justine

Rosie · January 29, 2020, 1:02pm

Hello @jwdebelius,

We are currently trying to repeat the pipeline without trimming the database, using Silva 132 99% OTUs full-length sequences.

I have tried to BLAST some unclassified ASVs at different taxa levels, sometimes I can get classification but just of partial sequences. Most of the times I get unclassified in NCBI too.

We use Miseq Illumina platform, and fasta files come out demultiplexed and with PhiX reads already removed. We need to remove primers, though.
What do you mean when you ask for primer removal?

Thank you!

Rosangela

SoilRotifer · January 29, 2020, 2:24pm

Hi @Rosie,

Depending on how your sequencing facility generates the sequence data, the primers used to amplify the target marker gene may be present in your actual sequence. These should be removed prior to any downstream analyses. QIIME 2 provides a way to remove these via q2-cutadapt.

As for the other points brought up, I agree with @jwdebelius. It is entirely possible that you may have many sequenced off-targets in your data, which can be quite common if these are microbial data from a host organism, even with other primer sets. Are these skin samples? I ask because V1-V3 is quite popular for that body site.

-Best wishes!

Rosie · January 29, 2020, 2:43pm

Hello @SoilRotifer,

yes, we removed primers using cutadapt.

and yes, samples are skin swabs

Nicholas_Bokulich · January 29, 2020, 3:24pm

Could you count the number of sequences before and after extracting with each primer pair? I have a sneaking suspicion it could be a primer coverage issue with the database.

Otherwise this could just boil down to differences in primer bias. Usually when we see classification at kingdom level it is non-target DNA... it's possible that you are getting more amplification of host DNA from skin with the V1-3 primers than the V3-4.

Rosie · February 4, 2020, 1:33pm

Hello,

sorry for the late reply.

I have 3 790 142 input reads and 2 375 818 denoised reads for 32 samples.

@shashankgpt has re-done the analysis changing DADA2 parameters. Now we have 10% of unclassified Bacteria (instead of 14%).
We have also removed ASVs assigned to host genome (533 out of 3192 ASVs) and we have 5% of unclassified Bacteria in the final V1V3 dataset.

In V3V4 dataset, unclassified Bacteria are less than 1% (and host ASVs are 179 out of 2213).

I will try to analyze data I produced with Silva 132 99% full-length and see if I can find strong differences.
Thank you!

Rosangela

Nicholas_Bokulich · February 4, 2020, 3:03pm

5% at the end of the day is not too bad... I'd recommend using NCBI BLAST to spot check a few of these first (exclude uncultured organisms), but it's not too uncommon to get a small number of non-target hits that cannot classify and should be thrown out. Sounds like you did that already:

So that's a pretty good indication that these are just junk / non-target reads!

That one primer set is more prone to non-target hits than another is not too surprising either... I recommend just tossing those unclassified reads and proceed.

system · March 6, 2020, 9:03pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.