Bug when classifying isolate 16S sequences

Nicholas_Bokulich · October 15, 2019, 10:42pm

Hi @laurenmlui,
Very strange. To give me something to work with, would you mind running qiime metadata tabulate on the full data (~1000 seqs) and non-buggy test subset (7-10 seqs) and posting the QZVs here?

Are your reads all in the same orientation? What happens with you add --p-read-orientation same (or --p-read-orientation reverse-complement) to the command? I suspect the orientation predictor is going haywire so setting the correct orientation may set things straight.

Why do I think that? classify-sklearn should not yield random results. So historically any time we see "random" behavior on subsets of a dataset it is due to the orientation predictor making different interpretations off of different subsets. Sometimes this is because the first 100 or so reads used to predict orientation really are in different orientations. Sometimes it is because they are in mixed orientations and the predictor just chokes.

Since you are using full 16S, the issue might not necessarily be that the reads are in mixed orientations, but rather because you might have longer 16S sequences than those present in SILVA (e.g., if you are using primers located outside or partially overlapping whatever primers were used to generate the sequences in SILVA). Trimming to the same primer sites used in SILVA might solve this if orientation does not.