V3-V4 16s - variable size after merging

Hey everyone,

I am currently working on amplicon sequencing data from the 16s V3-V4 region, using the 341-F and 806-R primers. Some information abouth sequencing lenght:

I received my files from the sequencing facility with the barcodes and primers, that I have trimmed using the plugin cutadapt:

qiime cutadapt trim-paired \
    --i-demultiplexed-sequences 01_intermediate_files/META004_full.qza \
    --p-cores 12 \
    --p-front-f CCTAYGGGRBGCASCAG \
    --p-front-r GGACTACNNGGGTATCTAAT \
    --verbose \
    --p-discard-untrimmed \
    --o-trimmed-sequences 01_intermediate_files/META004_full_clean.qza > 01_intermediate_files/META004_cutadapt_log.txt

(saved the log file to a txt file to make it easier to go through).
META004_cutadapt_log.txt (69.1 KB)

This confirmed the presence of primers & barcodes. I then went and used the clean data:

Some of the reads still had more than 231F and 228R lengths, I decided to trim all the reads at the specific positions:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs 01_intermediate_files/META004_full_clean.qza \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 231 \
  --p-trunc-len-r 228 \
  --verbose \
  --o-table 01_intermediate_files/META004_full_clean_table.qza \
  --o-representative-sequences 01_intermediate_files/META004_full_clean_RepSeqs.qza \
  --o-denoising-stats 01_intermediate_files/META004_full_clean_DenoisingStats.qza

After this, I went and check the length distribution of the Features:

qiime feature-table tabulate-seqs \
  --i-data 01_intermediate_files/META004_full_clean_RepSeqs.qza \
  --o-visualization 02_visualizations/META004_full_clean_RepSeqs.qzv \
  --verbose


Following that, I have some questions:
(1) Is it expected for different lengths in the ASV's? Considering the overlap may vary among reads, this might be expected, I would assume. In fact, I saw the overlapped reads had different sizes when I ran the trimmed reads through FLASH. However, length size might also be determinant for ASV's, meaning similar sequences in different ASVs because they have different sizes, which could be a relevant scenario in this context, so I don't know if this is not worrisome;

(2) The expected size of the amplicon is 466bp, 429 if we do not consider the primers. Is the fact that ASV's, on average, are 415 bp's long while the amplicon size is 429bp a reason to be concerned?

Sorry for the long post,

Thanks!

EDIT:

I decided to add reverse complements of the primers into the cutadapt step:

singularity run --bind /ifs /Users/.singularity_containers/ADA/qiime2_amplicon_2024.2.sif \
qiime cutadapt trim-paired \
    --i-demultiplexed-sequences full_clean/01_intermediate_files/META004_full.qza \
    --p-cores 12 \
    --p-front-f CCTAYGGGRBGCASCAG \
    --p-front-r GGACTACNNGGGTATCTAAT \
    --p-adapter-f ATTAGATACCCNNGTAGTCC \
    --p-adapter-r CTGSAGCVYCCCRTAGG \
    --verbose \
    --p-discard-untrimmed \
    --p-match-read-wildcards \
    --p-match-adapter-wildcards \
    --o-trimmed-sequences full_clean/01_intermediate_files/META004_full_clean_comp.qza \
> full_clean/01_intermediate_files/META004_comp_cutadapt_log.txt

Overall, I only regained ~40 sequences in total. The remaining profiles are rather similar.

1 Like

Hello @asbarros,

Seeing variation in the length of your ASVs is totally expected. Truncating your reads should only affect the overlap/merging process and not the overall ASV length, and you didn't trim, so I don't see any steps where you would have influenced the ASV length. I don't think it's necessarily concerning that your average length is less than the expected size. These will both certainly depend on community makeup. These seemed like your main two concerns, let me know if I missed anything.

1 Like

Hi @colinvwood

Thanks for your response, it is always good to feel reassured.

Meanwhile, I have realized the data was sequenced in NovaSeq, so these results might not be trustworthy.

Hello @asbarros,

I'm unsure what you mean by that--why would sequences from a NovaSeq sequencer be untrustworthy?

Sorry, I probably didn't express myself properly.

As already described in several places (such as Consequences of using dada2 on NovaSeq data · Issue #791 · benjjneb/dada2 · GitHub), DADA2 is not the ideal pipeline for NovaSeq data, so the results after dada2 here will probably need to be reviewed.

Hello @asbarros,

I see, interesting. I thought (incorrectly, apparently) that had been mostly resolved.

I am searching for references on the topic, in the forum and otherwise so, if you have any link you recommend, it would be much appreciated!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.