when truncation read shorter than classifier reference seqs

biojack · October 4, 2022, 5:17pm

Hi guys

Suppose I learned classifier 16S V4 with ref seqs lengths "250" ( so there should be 515F–806R and after truncation to "250" length there should be region like 515--765 )

When I got raw forward reads from lab ( with same primers ) I discoveres that first 20 nucleotides have a bad quality so I need to drop them. So in fact I will give to classifier not "250" length reads, but "230" length reads.

How much bias that will give in the end? I use standard classifier parameters like [7, 7] kmer length etc under learning.

Also interesting question -- is for such classifier will be valid to use just reverse reads? For example full read ( 806 -- 515 ) or truncated ( like 806 --- 556 )

SoilRotifer · October 4, 2022, 10:51pm

In my experience, there should not be any significantly noticeable difference if your reads are slightly shorter than your reference. When merging becomes problematic, many will simply map their forward, or reverse, read to their amplicon-region reference w/o issue.

Every data set is different. Sometimes, trimming your references down to your amplicon region helps, and other times there is no benefit. You could test this out yourself. That is, train and use the typical V4 classifier, and then train and use another V4 classifier that is trimmed, then compare them.

biojack · October 5, 2022, 3:58am

Thank you, Mike

But what do you think about reverse read in example? If classifier trained on 515--765 region will it be correct to run reverse read (like 806 -- 515 ). I mean that problem is there is a region 806 -- 766 in read which not used on classifier training. Or that also not necessary a problem and depends from dataset?

In my understanding there should be filter of first ~40 nucleotides from reverse read -- I just wonder if that correct thoughts..

SoilRotifer · October 5, 2022, 1:32pm

I think the issue with using reverse reads, at least for using feature-classifier classify-sklearn, is that it you would need to make a classifier that is comprised of the reverse compliments of the reference database prior to making the classifier. But I think you'd not need to do this if using classify-consensus-vsearch.

You can reverse compliment the reference database by using qiime rescript orient-seqs, and leave the --i-reference-sequences flag empty.

-Mike

system · November 5, 2022, 7:33pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.