I was trying to answer the following question using a custom classifier…‘could I create a custom classifier that will classify sequences for a handful of taxa that are of interest to me?’. I ended up trying to create a very simple model up front for proof-of-concept.
Step-wise (and with brevity in mind) I used an older GreenGenes (gg) data set (13_5) and:
Using windows CLI commands, pulled out a single phylogeny of interest along with the associated, assigned OTU value from the gg 99_otu_taxonomy.txt file,
I then used those OTU values to pull out all associated fasta files from the 99_otu.fasta file. *Point of interest, I did not consider my strategy here carefully enough and ended up pulling additional .fasta sequences /OTU’s not associated with a specific taxonomy/OTU in the taxonomy file,
I then following the instructions in the tutorial regarding creating a feature classifier. Importing the reference data sets (FeatureData[Sequence]’ - my rarefied fasta file) and the associated taxonomy (FeatureData[Taxonomy] - my uber-rarefied taxonomy file),
I extracted the reference reads from the fasta file using the 515F/806R primer pair to build a custom reference sequence artifact FeatureData[Sequence],
Lastly, I trained the classifier using my custom ref-seqs (from #4) and my custom reference taxonomy (from #3) using the naive-bayes model
Right. I then tested the classifier using a set of sequence data and surprisingly:
The script executed and identified my ‘taxa of interest’ but, even MORE surprisingly,
the output recorded hits for taxa that do not exist in the custom reference taxonomy that I created.
I think about this process in the following way - the classifier aligns the unknown, sample sequences against the custom ref-seqs and notes the OTUs associated…it then turns to the custom taxonomy file and pulls the taxonomy data associated with said OTU and returns that as the output in the taxonomy.qzv (along with feature counts/frequency, etc.). Is this correct? If not, where am I off-base?
I realize there is a lot to unpack here. As always, I appreciate the QIIME team and forum helping me to address my questions.
What is the best way to improve classification and detection of a handful of specific taxa?
I think a custom database may be part of the solution, but there may be easier places to start. Because, as you have discovered, creating a custom database is hard! Heck, even modifying an existing database is wildly hard.
I want to point you to the notes of Tony Walters as he modified the SILVA database v123 to be compatible with qiime. You would think two ribosomal databases would be largely similar, but no!
Constructing and especially validating a custom database is a serious amount of work, as Colin describes.
On the one hand, there is some precedence for custom databases improving classification in very specific environment types, e.g., fungi in the human intestine. If you are working with very well characterized sample types with simple, known community members, then this may be a reasonably achievable task (but still a lot of work).
On the other hand, you would need to test this new database very rigorously. You would need to understand the sample types very well. You would need to know and carefully curate the reference sequences very well. Folks write whole PhD theses on building reference databases — Rome wasn’t built in a day and neither is a useful reference database.
I do not mean to discourage you, and such things can be done. It looks like the steps that you outline are roughly how one could build a custom reference database. But you do not describe how you would test this database and that is where much of the hard work is. I must stress that testing is key if you want to then use this database and decide that the classifications you receive are trustworthy.
Tell me more! Do you want to find each and every one from a large data set? Do you want to understand their phylogeny or relatedness? Do you want to predict their function? Are you looking for mutations over time?
Give us the gorey details. Get specific!
Once we have a super specific set of goals, you could use those as the foundation for your database validation protocol. Or maybe we could come up with an elegant way to measure your key characteristics without starting a second PhD in database validation.
You may find this other current thread interesting. That user is following a similar process to yours, but for getting the brand-new SILVA release into a QIIME format following the notes of Tony Walters, not generating a new reference database. In any case, a thread to keep an eye on regarding steps to build a new database and import it into QIIME2.
Really appreciate all of the feedback - though I did not presume that a task of this nature would be easy, I didn’t think it would be a colossal undertaking. That said, in retrospect, it does make sense…there are whole companies that exist solely to offer their database(s) for use in NGS analytics.
Right. With regards to:
I’m working with data sets generated from paired-end Illumina reads that target the HVR4 region of the 16S gene. I have tried worked with the on-board MiSeq 16S Metagenomics analysis and have been comparing results from this to outputs of other analytical pipelines in order to help determine which is the best fit for me.
Essentially we are interested in the presence and relative abundance (identified sequence counts within a sample - not relative to another sample) of a limited number of taxonomical groups in biological samples - I really don’t care about organisms outside this core group for this study and it would be ideal for my approach (less noise, quicker analysis) if anything that did hit outside of these groups were masked or just plunked into a ‘Unknown / Unclassified’ category.
Maybe it would be better/easier to approach the problem by looking at methods to filter out the unwanted data post-analysis rather than trying to try to manipulate a db on the front-end? Thoughts? Suggestions on paths forward?
This sounds like a much less involved approach to me. filter-seqs could be used to remove any seqs that have less than X% similarity to a small set of organisms of interest, potentially streamlining downstream analysis.
Building a database is comparably easy. Validating it is a much more involved task and colossal may be an appropriate word depending on the goals/circumstances. The issue is that if you, e.g., created a custom database containing only your organisms of interest then you may receive many false-positives when you attempt to classify those sequences. So doing this for a well-characterized system is relatively simple, but for very diverse sample types this becomes much more challenging.
What sort of test datasets are you using to determine the best approach? If you already have appropriate simulated/artificial test data, then you can use these to validate a custom database (e.g., to evaluate overall false positive/negative detection). That makes this task much more feasible.
For your specific goal, I bet doing a direct search of all your demultiplexed, dereplicated reads against a small database using VSEARCH or SortMeRNA would be really fast. You could tune the search settings to reach the desired sensitivity by using your test data set.
Do you already have a test data set?
If you don’t already have one, all you need to include are
Positive control samples with just your specific taxa. You can use real samples or make samples computationally.
Negative control samples with none of your taxa of interest. Real or synthetic works well.
Mixed samples with some of your taxa, and also some other reads too.
Then you run that data set through a pipeline and see if those taxa are in the right samples. This will give you any false positives and false negatives. Who knows, qiime 2 might already perform very well.