We recently got Pacbio CCS reads back and we have processed these using DADA2 in R (as I know Q2 does not yet support Pacbio data denoising), giving us ASVs with lengths between 1300-1600 bp, predominantly. I have taken these ASVs in FASTA format and imported them into Qiime2-2020.2 as rep seqs. I have also imported a biom file for my feature table.
My question is - can Qiime2 perform taxonomy assignments of these full length amplicons? I have tried using some of the full length pre-trained classifiers provided and the sklearn method but taxonomy assignments were poor or seemingly erroneous. Logically I would assume this is possible within Qiime2 but I cannot find any relevant posts. Are there any special considerations I should be aware of that I might otherwise be missing? Happy to provide more details to troubleshoot but first just wanting to hear if this is even possible.
I am unsure if this question is appropriate for the User Support section based on the description so please feel free to move if not.
Yes, feature-classifier can theoretically be applied for classification of any marker gene, and this includes full-length 16S sequences. I can confirm that classification of full-length 16S seqs works successfully, but I have not personally tested with Pacbio reads.
A few possible issues to consider for troubleshooting:
It sounds like you are using classify-sklearn with pretrained naive Bayes classifiers. Are the reads in mixed orientations? classify-sklearn currently assumes that reads are all in a single orientation (if ~50% of the classifications are high quality and the other ~50% are low quality this is almost certainly the cause). You could use classify-consensus-vsearch, which can handle mixed-orientation reads.
Another possibility is if your reads are longer than the full-length 16S reference sequences (e.g., primers still attached, or do not use standard primers), in which case classification (with any classifier) will be impacted by having incomplete coverage of your query by the reference sequences. You could try using extract-reads on your sequences with the 27f-1492r primers (this is probably what, e.g., the SILVA database sequences are based on) and see if this improves classification.
If neither of those help, want to share the QZV boxplots of your data so we can take a look?
I do not believe the reads are in mixed orientations as I am not observing the 50/50 split in assignments you mentioned. On first attempt with the full length Silva classifier, I ended up getting alot of archaeal assignments. We have performed some classifications outside of Qiime to refer to so I know these are incorrect. In the interest of space, see this file debug_vsearch.txt (18.4 KB) for the error I receive during VSEARCH classification. Here is my .qzv taxonomy file I get from the sklearn classification with the pre-fitted Silva classifier taxonomy-silva-nb.qzv (1.2 MB) .
I looked into the primers we use and they are in fact the 27f/1492r pair. I tried running extract-reads anyways and no matches were found (which would indicate to me that the primers have been removed and the reads should fall within the confines of the classifier).
If needed, I can direct message the rep seqs file to you for a quick trial on your end. Which QZV boxplots are you referring to?
Sure, can you please DM the query sequences to me? I will take a look and try reproducing your error locally.
That is a bizarre error... looks like we've only seen it once before on this forum (and the OP never followed up with more details), so this is a new one really. The error seems to imply that the output from vsearch may be empty or corrupt? I can test locally once I get your file.
VSEARCH ERROR: This is an issue with your FASTA file having windows-style line endings. Other QIIME 2 plugins can handle these but VSEARCH blows up when it encounters them, causing this cryptic error, see here for some other examples: Dereplicated problem
SILVA BAD CLASSIFICATIONS: This looks like it is probably a quirk of SILVA, maybe the specific classifier that you were using. We have seen similar issues with SILVA classifiers, with unexplained classifications to unknown archaea in particular. This is usually a problem with unusually short or long sequences being included in the reference sequences. Using extract-reads usually fixes this issue (since it filters out unusually long/short seqs after in silico PCR) but this is not really an option for you… I’d recommend filtering out sequences that are shorter than expected for full-length 16S and training a fresh classifier.
So to fix your problems:
vsearch: export your sequences, convert to unix-style line endings, then re-import before proceeding with vsearch classification
SILVA: clean up the database and train your own classifier
OR you could use the pre-trained Greengenes full-length 16S classifier. I tested this first as a troubleshooting step and the classifications look pretty good. Greengenes has its issues (mainly being 7 years old) but if you can look past those this would be the fastest way to proceed with an out-of-the-box solution.
Wow, thank you, I can definitely say I would never have figured that out! I did the unicode conversion and now VSEARCH classification seems to be working well. I don’t think this is too off-topic so I will ask as a follow-up. I am hoping to achieve species level resolution with these full length sequences. Are there specific parameters you would recommend to achieve this with VSEARCH? The only non-default parameter I am currently running is a percent identity cutoff of 0.97. I am also using the 99% Silva OTUs as the reference. Assignments have been good overall so far but wanting to make sure I’m not missing any additional parameters that might help.
Thanks for the tips on the NB classifiers. I’ll likely try running Greengenes on my end to compare to the VSEARCH results. Yes, unfortunate that it is no longer maintained. Has there been any thought on swapping out Greengenes for a more up to date (but smaller than Silva) reference like RDP?
Again, thank you for the time and assistance. The community really appreciates it!
Do you want species level at any cost or the right answer? The two aren’t always the same, alas, though with full-length the species labels should be somewhat more reliable than with a single subdomain
You could see the results in this article — we benchmarked full-length 16S as well as subdomains. Table 2 shows different settings to optimize recall vs. precision… this was optimized on V4 I believe but full-length might not be too dissimilar.
I’ve improved the vsearch classifier since then… use the top-hit-only option, maybe try adjusting the min-consensus a bit to see what happens. It depends which way you want to go: more species-level classifications or fewer incorrect classifications.
I’d like to believe that we are not database biased here… both Greengenes and SILVA (and GTDB, which I would recommend trying out if you want something brand new, curated, and species level) happen to be in formats compatible with QIIME 2, and so we link to these on the QIIME 2 website and they get used by most Q2 users. This does not mean that we specifically recommend these databases, or that we are biased against RDP or any other database, just that it doesn’t come Q2 compatible out of the box.
RDP does not release a Q2-compatible format as far as I am aware (and we get enough questions on this forum about how to re-format RDP database for use with QIIME 2). Also, RDP does not have species-level information, as far as I recall. So by all means, use RDP if you can format it appropriately… but for your use case I’d recommend taking a look at GTDB and see what you think!
Thanks for all the support! I’ve continued rolling with vsearch max accepts at 1 and percent identity at 0.97. These are actually cultured bacterial isolates from an aquatic environment. Unfortunately, from what I can tell GTDB skews human-centric. I’ve been having good success so far using SILVA and have been using some of the iTOL functionallity (super useful!) to further explore how these ASVs relate to the reference sequences.
Understood on the database front, the Q2 compatible releases makes a lot of sense and I hadn’t realized these other databases aren’t releasing those versions.