classifier training

ranxx005 · February 10, 2020, 9:13pm

Hi,
Originally I am using 515F and 806R primers for my experiment, and I have 2X250 reads. However, the reverse read has very low quality. 1) I am thinking to only use the forward read. If so, after trimming my primers, I only have 230bp for taxonomy assignment. Do i need to train my classifier? if so, How? since the "qiime feature-classifier extract-reads" requires two primers. 2) If I continue to use the pair end reads, how could I lower the quality control in dada2 (right now I only do truncation, theoretically I have at least 48bp overlap, with the command below: qiime dada2 denoise-paired --i-demultiplexed-seqs demux-primer-trimmed-end_0.1.qza --p-trunc-len-f 212 --p-trunc-len-r 106 --p-trim-left-f 0 --p-trim-left-r 0 --p-n-threads 10 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza), so that I do not lose many reads (I am losing averagely 35% reads using dada2 and the major loss is in merging step). I also attached my qzv file for the F and R reads after removing primers.

Mehrbod_Estaki · February 10, 2020, 11:03pm

Hi @ranxx005,

They don't look too bad to me!

It would be better if you did.

Use your primer set as you would normally with paired-end and then add the --p-trunc-len parameter with 230 (or whatever you end truncating your forward primers with).

Sounds like you just might need to truncate a bit less. I think you're truncating ~182 bp. Try reducing the total truncation a bit.
This is how I would personally do a calculation:
Total reads: 2x250=500
Your region: 806-515) = 291
Overlap: 500-291= 209
max truncating: 209 - 12 (min overlap required by dada2) + 20 (random size variation in that region = 177
Theoretically, you should be able to truncate up to 177 bp (combined between both reads) without risking merging issues. Your 182 bp is pretty close to this, but still might be worth truncating a bit less. In case that isn't solving your issue, it may be possible that just have some shorter reads for some reason, maybe they are sequencing artifacts, chimeras, host contaminations etc. How does the length distribution look in these?
By the way 35% is not a bad DADA2 output, this is about what I would expect to see generally, give or a take a bit.

system · March 13, 2020, 5:03am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.