Same classifier, same otu, different sequencing run = different classification

shira · April 30, 2018, 2:07pm

I am analyzing results for the same sample sequenced by two different centers. One of the centers, center B, produced many more reads, but about 50% of them are not classified beyond Bacteria when following the tutorial pipeline. The odd thing is - you can find the same OTU being classified fine when coming out of sequencing center A but classified as Bacteria only coming from center B. The sequence is identical, same length, classified using the same classifier file, trained with the appropriate sequencing primers, which were the same for both sequencing runs. Any idea what's going on?
Thanks!

Nicholas_Bokulich · April 30, 2018, 4:53pm

Hi @shira,
Thanks for posting! Sounds like a very mysterious problem. Would you mind sharing the QZA files output by these methods so that I can inspect? You can send them directly to me via DM if you do not want these files to be publicly viewable on this forum.

This usually indicates the presence of contaminants or other sequences that are not similar to the reference sequences.

However, that sounds like it might not be the case here, since you report identical OTUs being assigned to very different taxa. Could you also please post a few of those examples?

Thanks!

shira · April 30, 2018, 6:40pm

Thanks for your help! I sent you the qza files for the problematic run.

Here is an example of one dominant feature showing up in both sequencing reaction -
aca380d3d06d653c6296f75364e4e6fb
TGGGGAATCTTGCACAATGGAGGAAACTCTGATGCAGCGATGCCGCGTGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTGTAGGGGAAGATAATGACTGTACCCTAAGAATAAGGTCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCCGTTGAGTTAGTTAATTGTGAAATCCCAAAGCTTAACTTTGGAACTGCAATTAAAACTGCTCGACTAGAGTTTGATAGAGGAAAGCGGAATACATAGTGTAGAGGTGAAATTCGTAGATATTATGTAGAACACCAGTTGCGAAGGCGGCTTTCTGGATCAACACTGACGCTGAGGCGCGAAAGTATGGGTAGCAAAGA

classified as:
k__Bacteria 0.9575929866612488

or in the other run:
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Pelagibacteraceae; g__; s__ 0.9999914303707826

I notice now that the confidence for the kingdom-only assignment is actually lower than the confidence value for the other taxonomic assignment. As if in the first classification run even kingdom level assignments are questionable...

BenKaehler · April 30, 2018, 9:33pm

Hi Shira, nice bug, thanks for posting it.

It's happening because for one of the samples the auto-read-orientation-detection function is being fooled:

$ qiime feature-classifier classify-sklearn --i-reads aca380d3d06d653c6296f75364e4e6fb.qza --i-classifier gg-13-8-99-nb-classifier.qza --o-classification forward.qza --p-read-orientation same
Saved FeatureData[Taxonomy] to: forward.qza
$ qiime tools export forward.qza --output-dir .
$ cat taxonomy.tsv 
Feature ID	Taxon	Confidence
aca380d3d06d653c6296f75364e4e6fb	k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Pelagibacteraceae; g__; s__	0.9108631646653241
$ qiime feature-classifier classify-sklearn --i-reads aca380d3d06d653c6296f75364e4e6fb.qza --i-classifier gg-13-8-99-nb-classifier.qza --o-classification reverse.qza --p-read-orientation reverse-complement
Saved FeatureData[Taxonomy] to: reverse.qza
$ qiime tools export reverse.qza --output-dir .
$ cat taxonomy.tsv 
Feature ID	Taxon	Confidence
aca380d3d06d653c6296f75364e4e6fb	k__Bacteria	0.9518916983253689

The autodetect function works by checking the median confidence of the first 100 reads in both orientations. Is it possible that for the run that gave you the k__Bacteria classification that there was a mix of read orientations?

BenKaehler · April 30, 2018, 9:43pm

I've posted a suggestion on the repo that we warn when the autodetector detects reverse read orientation.

Nicholas_Bokulich · April 30, 2018, 9:59pm

@shira — note that you can override this behavior by setting classify-sklearn's read-orientation parameter to same or reverse-complement if you know the orientation of your reads relative to the reference sequences.

If your reads are in a mixed orientation, this workaround will not help — note that mixed-orientation reads are also a problem for denoising/OTU picking, if your reads are mixed-orientation on one of these runs we can point you in the direction of other forum posts that may be useful.

shira · May 1, 2018, 12:06pm

Your test is very convincing. I looked into this a bit, and the data we get is not supposed to have mixed orientation. I ruled out having a simple typo in the manifest file, since mis-oriented reads seem to be present in all samples.

Tried doing this and generated an all-fwd taxonomy and an all-rev-comp taxonomy. Here is what I found:

The problem persists and the two generated bar-plots both have 50% k__Bacteria classified features.
Inspecting one very highly represented OTU I find that it always shows up in one orientation only in the rep-seqs, you never find its reverse compliment OTU (tried even looking for the reveres and the compliment just in case). This is an OTU that is equally represented in all samples and takes up about 7% of the reads in each (25,000 frequency out of a total of 350,000 reads).

Given these finding - would you still say this looks like mixed orientation reads? If so, how come this abundant otu always shows up in the same orientation? Many thanks...

shira · May 2, 2018, 12:10pm

OK - so I determined that the taxonomy artifact is a function of the denoising step. I am looking at an amplicon that is around 450bp. Below are the taxonomy results of the same sample after using two different sets of dada2-denoise parameters. How does this happen? Could you please direct me towards guidelines to help determining the best parameters to use while denoising?

--p-trim-left-f 17
--p-trim-left-r 21
--p-trunc-len-f 301
--p-trunc-len-r 200

--p-trim-left-f 17
--p-trim-left-r 21
--p-trunc-len-f 277
--p-trunc-len-r 223

BenKaehler · May 2, 2018, 8:12pm

Thanks @shira,

The problem persists and the two generated bar-plots both have 50% k__Bacteria classified features.

Does that mean even when you set --p-read-orientation same, you get 50% k__Bacteria?

would you still say this looks like mixed orientation reads? If so, how come this abundant otu always shows up in the same orientation?

Mixed-orientation reads were speculation on my part for why the autodetector is being fooled. To get to the bottom of it I will have to look at the data. Do you mind if @Nicholas_Bokulich shares it with me?

As for why tweaking the dada2 parameters impacts this problem, I will let my learned colleagues respond to that one.

shira · May 2, 2018, 8:39pm

Thanks for your input - I sure am learning a lot here!

The answer is yes - if I set the orientation in either of the two directions I get 50% k__Bacteria.

If I tweak the denoise a bit by trimming the reverse reads more and the forward reads less then I lose the k__Bacteria as you can see, but I also lose about 50% of my features.

Please go ahead and look at the data I shared with @Nicholas_Bokulich, thanks.

BenKaehler · May 2, 2018, 11:49pm

Ok, so it appears the "scot" dataset is the one that you're having trouble with. I calculated the confidences for these using the standard full-length 16S classifier forcing both same and reverse-complement. The autodetector uses the first 100 sequences. The results for those sequences are below.

Looking at the sequences, there is definitely something funny about the ones that look like they're reversed. You can spot them in the following table where "Reverse" is greater than "Same". Note that these correlate almost exactly with the first nucleotide not being a T. The funniness goes past the first nucleotide, but I didn't include it to avoid publishing your data here and because I don't need to.

So in summary, there is something odd about exactly half the first 100 sequences, so I would guess that there is something odd about half of the rest of them, too. That's what is fooling the autodetector.

Sorry that doesn't solve your problem but it does reduce its scope.

Same	Reverse	First Nucleotide
1.000	0.002	T
0.869	0.002	T
0.002	0.619	A
0.998	0.002	T
0.002	0.957	A
0.003	1.000	A
0.002	0.614	A
1.000	0.002	T
0.002	0.990	A
0.015	0.002	T
0.002	0.793	A
0.939	0.002	T
1.000	0.003	T
1.000	0.002	T
0.002	0.925	A
0.994	0.003	T
0.981	0.002	T
0.002	1.000	T
0.002	1.000	A
1.000	0.002	T
0.002	0.652	A
0.744	0.002	T
0.001	0.535	A
1.000	0.002	T
0.669	0.002	T
0.002	0.996	A
0.002	1.000	A
0.982	0.002	T
0.002	0.982	A
0.999	0.002	T
0.003	1.000	T
0.999	0.002	T
0.002	0.871	A
0.002	0.871	A
0.992	0.002	T
0.002	0.820	A
0.959	0.001	T
0.882	0.002	T
1.000	0.001	T
1.000	0.001	T
0.002	0.914	A
0.002	0.988	A
0.965	0.002	T
0.371	0.002	T
0.002	0.826	A
0.077	0.003	T
0.002	1.000	A
0.890	0.001	T
0.002	1.000	A
0.955	0.002	T
0.002	0.970	A
0.999	0.002	T
0.002	0.527	A
0.537	0.002	T
0.002	1.000	A
0.935	0.002	T
0.342	0.002	T
0.002	1.000	G
0.987	0.002	T
0.003	0.999	A
0.002	0.998	A
0.998	0.002	T
0.003	0.162	A
0.002	0.988	A
0.002	0.998	T
0.999	0.002	T
0.972	0.002	T
0.002	1.000	A
0.002	0.714	A
0.002	0.813	A
0.002	0.993	A
0.002	1.000	A
1.000	0.002	T
0.999	0.002	T
0.002	0.974	A
0.999	0.002	T
0.002	0.996	A
0.002	0.999	A
0.002	0.568	A
0.998	0.002	T
0.002	0.998	C
0.997	0.002	T
0.003	0.579	A
0.002	0.973	A
0.002	0.997	A
0.034	0.002	T
1.000	0.002	T
0.412	0.002	T
0.001	1.000	A
0.002	0.427	A
0.002	0.997	A
0.952	0.002	T
0.002	0.584	G
0.002	0.415	A
0.782	0.003	T
0.813	0.002	T
0.769	0.002	T
0.002	1.000	A
0.002	0.957	A
0.002	0.934	A

Nicholas_Bokulich · May 3, 2018, 12:29am

Do you have dada2 run statistics? E.g., on the number of input and output reads? Changing dada2 parameters may result in many more reads being dropped, e.g., due to low-quality bases on the reverse read being included, causing dramatic shifts in taxonomic profiles.

The QIIME 2 tutorials do discuss trimming guidelines a bit, but you can also go to the source: dada2 tutorial for guidelines.

If you could share the read quality profiles of these reads prior to trimming, we could help advise on specific settings for these runs.

Given what @BenKaehler is finding, I would say yes it looks like you probably have mixed-orientation reads, unless if these "suspect" reads that Ben is finding are something else, e.g., noisy reads that are confusing the autodetector but the other reads are okay. What protocol(s) are being used for library construction? Is the same library sent to both sequencing centers, or are you sending genomic DNA?

You could also try using the blast- or vsearch-based consensus taxonomy classifiers on your sequences to get a "second opinion". These would not be affected by mixed-orientation reads.

I hope that helps!