Bug when classifying isolate 16S sequences

laurenmlui · October 15, 2019, 10:31pm

Hello! We thought that using QIIME2 might be an easy way to classify some full length 16S sequences that we have of our isolates with the latest SILVA database, but when we try to classify all of our sequences, 72% are classified as Nanoarchaea. The strange thing is that when we use a small subset of the sequences (7-10), we don't get this bug. It's only when we try to classify ~1000 sequences we get the bug. The closest thing that I could find on the forum was this: Wrong 'taxonomy.qzv' file - #9 by Dchung

I am using the silva classifier from the QIIME2 downloads page. The fasta file is one line per ID and sequence, so I don't think that the import format is the issue. The code that I'm using is below.

qiime tools import --input-path <input fasta> --output-path <sequence artifact> --type 'FeatureData[Sequence]'

qiime feature-classifier classify-sklearn --i-classifier silva-132-99-nb-classifier.qza --i-reads <sequence artifact> --o-classification <classification artifact>

Nicholas_Bokulich · October 15, 2019, 10:42pm

Hi @laurenmlui,
Very strange. To give me something to work with, would you mind running qiime metadata tabulate on the full data (~1000 seqs) and non-buggy test subset (7-10 seqs) and posting the QZVs here?

Are your reads all in the same orientation? What happens with you add --p-read-orientation same (or --p-read-orientation reverse-complement) to the command? I suspect the orientation predictor is going haywire so setting the correct orientation may set things straight.

Why do I think that? classify-sklearn should not yield random results. So historically any time we see "random" behavior on subsets of a dataset it is due to the orientation predictor making different interpretations off of different subsets. Sometimes this is because the first 100 or so reads used to predict orientation really are in different orientations. Sometimes it is because they are in mixed orientations and the predictor just chokes.

Since you are using full 16S, the issue might not necessarily be that the reads are in mixed orientations, but rather because you might have longer 16S sequences than those present in SILVA (e.g., if you are using primers located outside or partially overlapping whatever primers were used to generate the sequences in SILVA). Trimming to the same primer sites used in SILVA might solve this if orientation does not.

laurenmlui · October 16, 2019, 7:27am

Hi @Nicholas_Bokulich,

Thank you for your help, It looks like what you suspected about the read orientation was right. The issue was that not all of the 16S sequences were in the same orientation. When I added --p-read-orientation same to the classification step almost all of the sequences had classifications that made sense. I filtered out the ones that were still classified as Nanoarchaea and ran the classifier on those with --p-read-orientation reverse-complement and the classifications were what I was expecting.

Thanks!

system · November 16, 2019, 1:27pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.