Feature classifier training

Sausage_Mahoney · December 14, 2017, 6:22pm

Hi all-
I was trying to answer the following question using a custom classifier...'could I create a custom classifier that will classify sequences for a handful of taxa that are of interest to me?'. I ended up trying to create a very simple model up front for proof-of-concept.

Step-wise (and with brevity in mind) I used an older GreenGenes (gg) data set (13_5) and:

Using windows CLI commands, pulled out a single phylogeny of interest along with the associated, assigned OTU value from the gg 99_otu_taxonomy.txt file,
I then used those OTU values to pull out all associated fasta files from the 99_otu.fasta file. *Point of interest, I did not consider my strategy here carefully enough and ended up pulling additional .fasta sequences /OTU's not associated with a specific taxonomy/OTU in the taxonomy file,
I then following the instructions in the tutorial regarding creating a feature classifier. Importing the reference data sets (FeatureData[Sequence]' - my rarefied fasta file) and the associated taxonomy (FeatureData[Taxonomy] - my uber-rarefied taxonomy file),
I extracted the reference reads from the fasta file using the 515F/806R primer pair to build a custom reference sequence artifact FeatureData[Sequence],
Lastly, I trained the classifier using my custom ref-seqs (from #4) and my custom reference taxonomy (from #3) using the naive-bayes model

Right. I then tested the classifier using a set of sequence data and surprisingly:

The script executed and identified my 'taxa of interest' but, even MORE surprisingly,
the output recorded hits for taxa that do not exist in the custom reference taxonomy that I created.

I think about this process in the following way - the classifier aligns the unknown, sample sequences against the custom ref-seqs and notes the OTUs associated...it then turns to the custom taxonomy file and pulls the taxonomy data associated with said OTU and returns that as the output in the taxonomy.qzv (along with feature counts/frequency, etc.). Is this correct? If not, where am I off-base?

I realize there is a lot to unpack here. As always, I appreciate the QIIME team and forum helping me to address my questions.

Best
S

colinbrislawn · December 14, 2017, 9:44pm

I think this is a fascinating question!

What is the best way to improve classification and detection of a handful of specific taxa?

I think a custom database may be part of the solution, but there may be easier places to start. Because, as you have discovered, creating a custom database is hard! Heck, even modifying an existing database is wildly hard.

I want to point you to the notes of Tony Walters as he modified the SILVA database v123 to be compatible with qiime. You would think two ribosomal databases would be largely similar, but no!

Tony ends up writing 11 custom scripts to process the data. The notes are +2000 words long. And Tony is not making a database, he's just modifying one.
ftp.microbio.me/emp/release1/otu_info/silva_123/Silva_123_notes.txt

Yep! Database creation: RUN!

I think you are off to a good start. My first thought would be to use vsearch to do closed-ref clustering of all your reads against the handful of selected taxa from 99_otu.fasta. Maybe like this:

vsearch --usearch_global labeled_seqs.fna \
--db my_taxa_gg_99_otu.fasta \
--biomout map99.biom --alnout map99.aln
--strand plus --id 0.99

Let's see what the qiime devs suggest.

Colin

Nicholas_Bokulich · December 15, 2017, 2:38am

Hi @Sausage_Mahoney,
I could not agree with @colinbrislawn more when he says that:

Constructing and especially validating a custom database is a serious amount of work, as Colin describes.

On the one hand, there is some precedence for custom databases improving classification in very specific environment types, e.g., fungi in the human intestine. If you are working with very well characterized sample types with simple, known community members, then this may be a reasonably achievable task (but still a lot of work).

On the other hand, you would need to test this new database very rigorously. You would need to understand the sample types very well. You would need to know and carefully curate the reference sequences very well. Folks write whole PhD theses on building reference databases — Rome wasn't built in a day and neither is a useful reference database.

I do not mean to discourage you, and such things can be done. It looks like the steps that you outline are roughly how one could build a custom reference database. But you do not describe how you would test this database and that is where much of the hard work is. I must stress that testing is key if you want to then use this database and decide that the classifications you receive are trustworthy.

Good luck!

colinbrislawn · December 15, 2017, 5:34am

Thanks for 'qiimeing' in @Nicholas_Bokulich

Let's zoom out. You care about

Tell me more! Do you want to find each and every one from a large data set? Do you want to understand their phylogeny or relatedness? Do you want to predict their function? Are you looking for mutations over time?

Give us the gorey details. Get specific!

Once we have a super specific set of goals, you could use those as the foundation for your database validation protocol. Or maybe we could come up with an elegant way to measure your key characteristics without starting a second PhD in database validation.

Good thread.
Keep in touch,
Colin

Nicholas_Bokulich · December 15, 2017, 5:53pm

@Sausage_Mahoney,
You may find this other current thread interesting. That user is following a similar process to yours, but for getting the brand-new SILVA release into a QIIME format following the notes of Tony Walters, not generating a new reference database. In any case, a thread to keep an eye on regarding steps to build a new database and import it into QIIME2.

Sausage_Mahoney · December 15, 2017, 10:07pm

Hi all-
Really appreciate all of the feedback - though I did not presume that a task of this nature would be easy, I didn't think it would be a colossal undertaking. That said, in retrospect, it does make sense...there are whole companies that exist solely to offer their database(s) for use in NGS analytics.

Right. With regards to:

I'm working with data sets generated from paired-end Illumina reads that target the HVR4 region of the 16S gene. I have tried worked with the on-board MiSeq 16S Metagenomics analysis and have been comparing results from this to outputs of other analytical pipelines in order to help determine which is the best fit for me.

Essentially we are interested in the presence and relative abundance (identified sequence counts within a sample - not relative to another sample) of a limited number of taxonomical groups in biological samples - I really don't care about organisms outside this core group for this study and it would be ideal for my approach (less noise, quicker analysis) if anything that did hit outside of these groups were masked or just plunked into a 'Unknown / Unclassified' category.

Maybe it would be better/easier to approach the problem by looking at methods to filter out the unwanted data post-analysis rather than trying to try to manipulate a db on the front-end? Thoughts? Suggestions on paths forward?

@Nicholas_Bokulich - thanks for the link to the SILVA thread...I'll take a look.

I continue to be impressed with the dedication and quick turn-around-time the QIIME dev team and community provides to help end-users with all manner of questions and issues.

Thanks again,
S

Nicholas_Bokulich · December 15, 2017, 10:24pm

Hi @Sausage_Mahoney,

This sounds like a much less involved approach to me. filter-seqs could be used to remove any seqs that have less than X% similarity to a small set of organisms of interest, potentially streamlining downstream analysis.

Building a database is comparably easy. Validating it is a much more involved task and colossal may be an appropriate word depending on the goals/circumstances. The issue is that if you, e.g., created a custom database containing only your organisms of interest then you may receive many false-positives when you attempt to classify those sequences. So doing this for a well-characterized system is relatively simple, but for very diverse sample types this becomes much more challenging.

What sort of test datasets are you using to determine the best approach? If you already have appropriate simulated/artificial test data, then you can use these to validate a custom database (e.g., to evaluate overall false positive/negative detection). That makes this task much more feasible.

I hope that helps!

colinbrislawn · December 16, 2017, 11:57pm

Good afternoon,

Thank you for sharing your goal (emphasis mine)

For your specific goal, I bet doing a direct search of all your demultiplexed, dereplicated reads against a small database using VSEARCH or SortMeRNA would be really fast. You could tune the search settings to reach the desired sensitivity by using your test data set.

Do you already have a test data set?

If you don't already have one, all you need to include are

Positive control samples with just your specific taxa. You can use real samples or make samples computationally.
Negative control samples with none of your taxa of interest. Real or synthetic works well.
Mixed samples with some of your taxa, and also some other reads too.

Then you run that data set through a pipeline and see if those taxa are in the right samples. This will give you any false positives and false negatives. Who knows, qiime 2 might already perform very well.

Colin

system · January 17, 2018, 5:57am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.