Taxonomy Assignment for Full-Length 16S (Already Denoised)

Nicholas_Bokulich · June 2, 2020, 3:42pm

Hi @Todd_Testerman,
Thanks for sharing your data. A few things:

VSEARCH ERROR: This is an issue with your FASTA file having windows-style line endings. Other QIIME 2 plugins can handle these but VSEARCH blows up when it encounters them, causing this cryptic error, see here for some other examples: Dereplicated problem - #2 by Nicholas_Bokulich
SILVA BAD CLASSIFICATIONS: This looks like it is probably a quirk of SILVA, maybe the specific classifier that you were using. We have seen similar issues with SILVA classifiers, with unexplained classifications to unknown archaea in particular. This is usually a problem with unusually short or long sequences being included in the reference sequences. Using extract-reads usually fixes this issue (since it filters out unusually long/short seqs after in silico PCR) but this is not really an option for you... I'd recommend filtering out sequences that are shorter than expected for full-length 16S and training a fresh classifier.

So to fix your problems:

vsearch: export your sequences, convert to unix-style line endings, then re-import before proceeding with vsearch classification
SILVA: clean up the database and train your own classifier
OR you could use the pre-trained Greengenes full-length 16S classifier. I tested this first as a troubleshooting step and the classifications look pretty good. Greengenes has its issues (mainly being 7 years old) but if you can look past those this would be the fastest way to proceed with an out-of-the-box solution.

Good luck!