Classifier for ion torrent data

Imindias · April 5, 2018, 5:03pm

Dear QIIMERS,
I've finally changed to QIIME2, but as an Ion Torrent user, I'm having some problems.
I've been able to adapt the process until the classifier. Here I am developing lots of problems.
First of all, although I loved it, I can't train my own data because the primers of the 16S Metagenomic kit are not public, or any user have them?
Then, using your trained data, I only get to obtain the classification for the 70% of my samples, by using the greengenes full length sequences, with the other region, it was a total disaster, as it was expected.
I tried to use the VSEARCH approach with the 99% OTUS greengenes file, but after 36 hours running, I killed the analysis.
Now, I am trying the Silva 119 99%OTUs full length sequences, but I am experimenting a lot of problems of memory. Thanks to other posts, changing the tmp directory to an external device, finally it seems like the analysis is running, but a lot of hours, I will let you know if it finishes.
So, I would like to know the way that other users of torrent data use for the feature classifier, and if they obtain good results.
Thanks in advance
Isabel

Nicholas_Bokulich · April 6, 2018, 12:55pm

Hello @Imindias,

If you know that your sequences are 16S, that pre-trained classifier should be good enough. Training a classifier trimmed to the exact primer region only boosts accuracy a little bit for 16S... the 30% of sequences that are not classifying are probably not due to the use of this pre-trained classifier. It is much more likely that these are non-target DNA (e.g., contaminants).

Are your reads in mixed orientations? (i.e., both forward and reverse reads on the 16S?) that can cause trouble for classify-sklearn (which must then be trained on mixed orientation reads) but not for classify-consensus-vsearch or blast. So that could also explain this 30% unclassified. (more details below)

See this thread. That user is also using Ion Torrent data and it sounds like they ran into similar issues as you, in particular with mixed-orientation reads.

Based on that thread, my advice is to use the vsearch classifier with greengenes 99% OTUs. You can use the --p-threads parameter to run multiple jobs, speeding up this analysis, if your system can support that. Unfortunately, this approach can be time-consuming... in my experience it should take far less than 36 hours to complete on a normal sized run, but perhaps you have a very large dataset? Aligning against full-length 16S sequences will also dramatically increase runtime, since alignment is computationally expensive. So trimming the reference reads to your primers should seriously speed up this analysis.

I do not know what primers the Ion Torrent 16S kit uses. @MMC_northS do you know? It sounds like you are maybe using the same kit for Ion Torrent sequencing?

I hope that helps!

MMC_northS · April 6, 2018, 1:38pm

Hi @Nicholas_Bokulich and @Imindias
I am not sure really what do you mean with primers of 16S Metagenomic kit privet or public.

I am using primers for 18S for eukaryote (but they amplify also some sequences from 16S in prokaryote) and my primers are public from one scientific paper. In the same way you have primers only for 16S so the primers are depending on your study.

In other way also exist other kit for 16S in PGM which really do not amplify only one fragment if not more than one and Ion torrent has a specific software to analyze those data, so I do not know how to use it in QIIME (1 or 2).

What primers are you using? What is your objective @Imindias ?
Best,
MMC

Imindias · April 6, 2018, 2:45pm

Thanks so much @Nicholas_Bokulich
Finally, the silva run finished and logically, it was worse than with greengenes.
I decided to came back to vsearch, and try with the cheap approach of the 85%. It was much better than with the rest of approaches, so now I know that the classify-consensus-vsearch is my option. I am trying with the 97% and then I'll try again the 99%, but now I know that it lasts a couple of days to run. I'll try the multiple jobs option, although I am not sure if my computer can support that... well, I'll try it!
Many many Thanks!!!

Nicholas_Bokulich · April 6, 2018, 2:49pm

Good idea for confirming that it works with 85% — I would recommend 99% since this will provide the most sensitive classification. Trimming with your primers would really help if you know them... however, from the sound of @MMC_northS's advice there may be a mixture of primers/orientations in which case you are best off using the full 16S reference with vsearch.

Ah my misunderstanding — I thought perhaps you were also using 16S primers and could help @Imindias. Thank you for your help!

Imindias · April 6, 2018, 2:51pm

Hi @MMC_northS,
thanks for replying. I have sequences of the 16S done with the Ion 16S Metagenomics kit, which are formed by two sets of primers: 1) V2-4-8, 2) V3-6,7-9. My problem is that I'd love train my data with classify-sklearn, but I need the primer sequences that are not published. This is my problem.

As you say, there is an specific data to analyze the data in the ion reporter, but it has a lot of limitations, as you can't introduce the experimental groups or any other parameter. With QIIME1, you can use a limited number of scripts with the files from ion reporter, I couldn't try with QIIME2 yet. Next step. I love it, because they use a curated database and you obtain the most of the species level, but I need to work on that...

Many thanks!

Isabel

Nicholas_Bokulich · April 6, 2018, 3:00pm

The classify-consensus-vsearch method actually has very good accuracy, similar to classify-sklearn. It uses vsearch for global alignment to a reference database (the same one used for training a naive bayes classifier), followed by an LCA consensus assignment in QIIME2 — so it isn't just a basic global alignment classifier.

Honestly, it is probably better to stick with vsearch for your case, where there are two issues to overcome for the sklearn classifier:

reads in mixed orientations
a mixture of primers. It sounds like these effectively cover most of the 16S rRNA gene, so the pre-trained full-length 16S classifier would be your best best. You will not be able to capture the slight accuracy boost that comes from training to the specific primer sites.

I hope that helps!