The parameters for training classifier

Dear developers,

I installed qiime2-2019.1 by conda.

My sequence strategies is PE300 for V4 region using ArBa515F (5’-GTGCCAGCMGCCGCGGTAA-3’) and Arch806R (5’-GGACTACVSGGGTATCTAAT-3’).

As a new comer to bioinformatics, I want to use pre-trained classifier ( Silva 132 99% OTUs from 515F/806R region of sequences and Greengenes 13_8 99% OTUs from 515F/806R region of sequences) from the QIIME2 data resource.

Is it suitable for using the above pre-trained classifiers according to my sequence strategies? If not, could you please provide some information about the right parameters selection for training classifier.

Thanks in advance!

Best regards,

Hao

No, those will probably not work because they use a different primer set from what you are using, so may not amplify the archaea that your primers are designed for. The forward primer is identical but the reverse primer has some different degenerate bases, so may give slightly different results.

I recommend training your own classifier. Just use default parameter settings and use one of the databases here.

Good luck!

1 Like

Thank you for your prompt answer.

Which fasta file (from rep_set or rep_set_aligned) do you recommend to train classifier? The tutorial might use the 85_otus.fasta file from ‘rep_set’ .

Thank you in advance!

See the note in that tutorial. Use the 99_otus.fasta

Sorry for my unclear statements.

There are two 99_otus.fasta files, one is from the file folder rep_set and one is from rep_set_aligned, which is better?

Thanks in advance.

Best regards,

do not use aligned sequences for this. Use the rep_set

2 Likes

New_level-1.csv (259 Bytes)
Old_level-1.csv (262 Bytes)
Dear Nicholas Bokulich,

Please see the attached file.

The Old_level-1 was classified by your pre-train classifier in data resources https://docs.qiime2.org/2019.1/data-resources/, and the ratio of unassigned reads was ver high (36%~46%). According to your advice, the new classifier was trained using my own primers 515F and 806R, the silva_132_99_16S.fna in SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99, the taxonomy_7_levels.txt in SILVA_132_QIIME_release/taxonomy/16S_only/99, and -p-max-length 300, but the ratio of unassigned reads was still high (12%~34%).

The sequence strategy was PE 300 for V4 region, and the raw data were denoised by DADA2,qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux.qza \ --p-trim-left-f 0 \ --p-trim-left-r 0 \ --p-trunc-len-f 300 \ --p-trunc-len-r 270 \ --o-table table.qza \ --o-representative-sequences rep-seqs.qza \ --o-denoising-stats denoising-stats.qza

I think there are enough overlaps for merging reads.

Do you think this situation is ok? or should I change some parameters, such as using the consensus_or majority_taxonomy.txt?

Thanks in advance.

Hao

That is quite normal — non-target DNA can be an issue. Use the forum search bar to see some other forum posts about this, and the types of non-target DNA that others have detected, as well as steps to double-check these results and determine what these reads may be in your own data. 50% or more unassigned is a problem that is most likely due to using the wrong classifier or a similar technical issue. Less than 50% is probably normal and not something to worry about.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.