Sequence length after denoise paired-end reads

kyle_paddock · September 22, 2022, 4:18pm

Hi,

I have a question regarding sequence length after denoising. I have amplified the V4 region using 515F-806R primers. I trimmed primers and adapter sequences and denoised based on quality scores observed in the demux summary in Qiime2-2022.8.

qiime cutadapt trim_paired 
	--i-demultiplexed-sequences paired-end-demux.qza \
	--p-adapter-f "ATTAGAWACCCBDGTAGTCC" \
	--p-front-f "GTGCCAGCMGCCGCGGTAA" \
	--p-adapter-r "TTACCGCGGCKGCTGGCAC" \
	--p-front-r "GGACTACHVGGGTWTCTAAT" \
	--p-indels False \
	--p-discard-untrimmed True \
	--o-trimmed-sequences demux-trim-primers.qza \

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux-trim-primers.qza \
  --p-trunc-len-f 231 \
  --p-trunc-len-r 209 \
  --o-table feature.table.qza \
  --o-representative-sequences denoise-seqs.qza \
  --o-denoising-stats denoise-stats.qza

These parameters resulted in a fairly high retention of my sequence after filtering, denoising, and merging (~86%). I then tabulated sequences to see what the lengths were and was surprised to see the range from 231-420bp. I'm under the impression they should be around 253 bp. Granted my mean length is 253. Is there a parameter I did not include to trim merged sequences to the correct length? Am I joining forward and reverse reads in incorrect places?

There was a topic similar to this but it was unclear if there should be concern with sequences that are too long.

Thanks!

lizgehret · September 27, 2022, 5:42pm

Hi @kyle_paddock,

Welcome to the :qiime2: forum!

This is a great question - and your impression is correct, your sequence lengths should be around 253 bp. A couple of good things to point out with your data set is that your mean length is 253.11, which is right where it should be! Additionally, the standard deviation is only about 3bp which means that most of your data is sitting right around that 253 bp mark.

However, you do seem to have some outliers on both ends (the min and max lengths) - especially that max length, which is much longer than the V4 region. The short answer is that these lengths can be related to non-target DNA, which you will most likely want to filter out unless they are of interest to you. You could try filtering out anything that's shorter than 240 and longer than 255 and see what the statistics look like after that, but that should remove those outliers you're seeing.

This is a great forum post that goes into more detail on this situation, for your reference.

Hope this helps! Cheers

kyle_paddock · October 13, 2022, 6:31pm

Thanks @lizgehret !

I have since filtered my sequences to include 240 < seq length < 255. However, I'm a little confused about how to regenerate a feature table with these new sequences.

I can find ways to filter sequences based on a feature table, but not the other way around. Do I need to rerun my new set of representative sequences back through dada2 denoise-paired to create a feature table?

Thanks for the help!

lizgehret · October 20, 2022, 5:16pm

Hi @kyle_paddock,

Apologies for the delay in response! That's a great question - you can do this using qiime feature-table filter-features and use your filtered rep-seqs.qza as metadata. Here's what that would look like as a command:

qiime feature-table filter-features \
--i-table YOUR_ORIGINAL_FEATURE_TABLE \
--m-metadata-file rep-seqs.qza \
--o-filtered-table YOUR_FILTERED_FEATURE_TABLE

Hope this helps! Cheers

system · November 20, 2022, 11:16pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.