absolutly beginner here - I am taking my first steps with QIIME2 and I am so lost (although I am working through the tutorials). Any pointers would be much appreciated!
I have an OTU list (all processed already, ~200 OTUs) in fasta format that I converted into a QIIME artefect. Now I need to do a taxonomic classification and I'm completely stuck on how to do that.
Would rescript (?) give me a NCBI (reference?) database? And if yes, how would I proceed from here? I have 12s sequences and expect mainly mammals, birds and fish. Maybe a combination of MitoFish and NCBI?
Sorry for the really basic question! I've gone back to university after many years in applied conservation and bioinformatics is completely new to me.
Once you have your database, you could proceed analyzing your own data as shown for bacteria or fungi in other tutorials, but using a 12S database instead. I have never used 12S myself, though, so there might be some other quirks to 12S analysis, e.g., maybe some parameters that need to be adjusted. Perhaps others on the forum with 12S experience can give their input if there are.
I followed your detailed tutorial for my approach.
Things I adjusted are the taxa to "txid7711", the default ranks "kingdom phylum class order family genus species", set the min_bp to "150", and the max_bp to "800".
I trained my classifier and classified my OTUs (qiime feature-classifier classify-sklearn \ --i-classifier ncbi-12S-refseqs-classifier.qza \ --i-reads sequences.qza \ --o-classification sequences_tax.qza).
The results are excellent. But for some OTUs I know exactly that they should be "Gallus gallus", instead it is "Bambusicola thoracicus".
Is there anything I can optimize to receive (more) correct classifications? (starting from get-ncbi-data, filters, etc.)
Genbank contains lots of misannotated sequences. So some quality control would be in order. Removing abnormally short of long seqs would be one option (150-800nt seems like a wide range! but I don't know the expected range for 12S). A more advanced option would be to build a phylogeny and look for evident misplacements (problematic with short sequences, but hopefully this could at least get genus right ), then filter out any misannotations.
How do you know? Do you have simulated sequences or positive controls? Just curious.
I will take a look into building a phylogeny. For now, it seems like a lot of work
Yes, positive controls
I also tried the classify-consensus-blast classification method. This didn't give as many values for genus and species as my recent classifier, but the results are more precise based on the positive controls. qiime feature-classifier classify-consensus-blast --i-query sequences.qza --i-reference-taxonomy NCBIdata_12S/taxonomy.qza --i-reference-reads NCBIdata_12S/sequences.qza --o-classification classification.qza --o-search-results searchresults.qza
Is there any reason? The database should be the same.
These methods work in quite different ways. the BLAST classifier uses local alignment, followed by consensus classification of the blast hits, whereas q2-sample-classifier is based on kmer frequencies. If the 12S has repetitive sequences, this could explain why BLAST would be much more precise. You might also try classify-consensus-vsearch to see which works best for you.
Yes unfortunately database curation is never an easy task. But the hope is that you do it once and have a database for life. (well, maybe a couple years at least)