How can I train classifier for Paired end reads?

Aqleem12 · October 16, 2017, 7:01am

Dear All,
I have a question. I have got paired reads from the company. The sequence base for each forward and reverse read truncated at 300 each. Then I use Greengenes (16S rRNA) 13.8 marker gene database using following commands.
qiime feature-classifier extract-reads
--i-sequences 99_otus.qza
--p-f-primer CCTACGGRRBGCASCAGKVRVGAAT
--p-r-primer GGACTACNVGGGTWTCTAATCC
--p-trunc-len 300
--o-reads ref-seqs.qza
I don't know whether I am wrong or right, Please help me to train a classifier for my paired end reads. I

jairideout · October 17, 2017, 5:21pm

Hi @Aqleem12! Taxonomic classification of features is largely the same for single-end vs. paired-end reads. You'll first need to denoise your paired-end sequences with DADA2, which will join your reads for you and produce a feature table and representative sequences. The Atacama soil microbiome tutorial shows how to use dada2 denoise-paired to accomplish this.

Once you have representative sequences (i.e. a FeatureData[Sequence] artifact), you can classify those following the suggestions in the feature-classifier tutorial. As described in that tutorial, you'll need to make sure that the region of the reference sequences you're extracting with feature-classifier extract-reads covers the region of your amplicons.

When classifying representative sequences that came from joined paired-end data, it's recommended to not supply a truncation length (so in your command, you can remove --p-trunc-len 300). That part isn't described in the tutorial, we'll add a note about that in an upcoming release! I created an issue to track progress on that.

Aqleem12 · October 17, 2017, 9:51pm

Dear Sir,

Thank you for your reply. I did denoise as follow with DADA2,
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--o-table table
--o-representative-sequences rep-seqs
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 300
--p-trunc-len-r 300

Now I want to train classifier at 99-otus. Green genes(13.8 the recent one) with following features using company primers
qiime feature-classifier extract-reads
--i-sequences 99_otus.qza
--p-f-primer CCTACGGRRBGCASCAGKVRVGAAT
--p-r-primer GGACTACNVGGGTWTCTAATCC
--p-trunc-len 600
--o-reads ref-seqs.qza
Now it tooks me two days, still the process is not finished. I don't know what to do, where I am doing mistake? If you could help me, I would be very thankful.

jairideout · October 18, 2017, 11:49pm

Please see my previous reply about trunc-len:

Besides the trunc-len issue noted above, the command looks correct, assuming those are the correct primers. Since you're extracting reads from 99% Greengenes reference sequences, this could take awhile to complete (I don't have a time estimate unfortunately).

Is the process still running? How much memory (RAM) do you have, and how much is being used by the process?

I also recommend trying out extract-reads on a smaller Greengenes file, such as the 85% reference sequences. That should complete relatively quickly and will let us know if extract-reads is working at all with your primers or if something else is going on (e.g. an install/deployment issue).

Aqleem12 · October 19, 2017, 5:07am

Dear Sir,

Thanks for your reply. I use the primers with 85% Greengene reference sequences. They work well and quickly. The genus and species were absent in that 85%Greengene . However when I run the primers with 99%Greengenes it took almost two days, however disconnected and could not give me the required results. I repeat the 99%Greengenes for the last one week however failed. About the memory, I don't know, because I am connecting my laptop with a lab server. The company is saying the primers will work well with Silva128 databases. However I don't know which commands or how to train for silva databases as well as I don't know what is the strategy for --p-trunc-len in case of Silva databases whether I remove it or not. Even the Silva128 data bases there are rep.set something and taxonomy on the other hand, alot of complication to get qza files. If you would help and assist me further, I would be thankful.

Best regards,
Aqleem Abbas

jairideout · October 19, 2017, 5:02pm

That's great news, thanks for trying that out!

Since you're running these commands on a remote server, you can lose connection for a number of reasons (e.g. closing your laptop lid, turning off your computer, disconnecting from the network, etc.). If you lose connection your job will be terminated, so that's probably what's happening.

I recommend using a program such as screen, which will allow you to create a session that you can reconnect to if your connection is lost, and your jobs will keep running in the meantime. We don't develop the screen tool so I can't provide support for that. There's various tutorials on the internet, or you could check with your server administrator if you need help with screen. They may have other suggestions for keeping long-running jobs alive.

You can use a program such as top (available on most *nix systems) or htop (much nicer interface) to monitor memory and CPU usage on the remote server. Once you're logged into the server you can use one of those tools to monitor system resources while the job is running. We also don't develop top or htop so can't provide specific support for those tools here.

The process is basically the same as training with Greengenes. You can follow along with the feature-classifier tutorial, substituting the Silva files for the Greengenes files.

Since you'll be classifying features derived from paired-end data, you shouldn't supply a trunc-len with SILVA either.

After downloading and extracting the SILVA reference database, you'll want to use the files in the SILVA_128_QIIME_release/rep_set/ and SILVA_128_QIIME_release/taxonomy/ folders. You can choose from 16S-only, 18S-only, or 16S+18S. You can also choose the percent identity (e.g. 99% OTUs) of interest.

jairideout · October 20, 2017, 4:32pm

An off-topic reply has been split into a new topic: What is the difference between Greengenes and SILVA?

Please keep replies on-topic in the future.

jairideout · October 20, 2017, 4:35pm

I split off @Aqleem12's latest reply into a new forum topic because there's a question in there that is better suited for its own topic. Following up here to confirm that @Aqleem12 was able to successfully extract reads from 99% Greengenes reference sequences, it took around 12 hours to complete.

system · November 20, 2017, 10:36pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.