Why does the same sequence appear at the start of all output sequences from DADA2?

Hi forum members,

I have been using DADA2 for sequence analysis and I have encountered something puzzling. In all of my output sequences, there are several dozen nucleotides that are identical at the start of each sequence.

Here are a couple of examples:

  1. GCAAGTTGCGCCCGAAGCCATTCGGCCGAGGGCACGTCTGCCTGGGTGTCACGCATCGTTGCCCCCCTCAAACTTCGGTTCGGGTGGGGCGGAAGTTGGCCTCCCGTGCGTGCCTGCGCGCGCGGTTAGCCCAAAAGCGAGTCCTCGGCGACGAGCGCCACGACAATCGGTGGTTTTTTTACCCTCGTTCCTTGTCGTGCGTGCCCCGTCGCCCGAACGCGCTCTTGCGACCCTCACGCGTCGCCTCGGTGGCGCTCCCAA
  2. GCAAGTTGCGCCCGAAGCCATCAGGTTGAGGGCACGTCTGCCTGGGCGTCACATATCGTTGCCCGATGCCTATTGCAATGCAATAGGAATTTCTAGGGCGAATGATGGCTTCCCGTGAGCTTTGTTGCCTCGCGGTTGGTTGAAAATTGAGTCCTTGGTAGGGTGTGCCATGATAGATGGTAGTCGAGTTAGCACAATACCGATCATGTGCATGCTCCCCAAAATATGGCCTCTATGA

As you can see, the same sequence is present at the start of both sequences. I'm unsure why this is happening, and I would appreciate any insights or advice you can provide.

Forward primer: TCGTCGGCAGCGTCAGATGTGYAYAAGAGACAGATGCGATACTTGGTGTGAAT
Reverse primer: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCCGCTTAKTGATATGCTTAAA

I used the following command to run DADA2:

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trunc-len-f 233
--p-trunc-len-r 150
--p-trim-left-f 53
--p-trim-left-r 55
--p-n-threads 8
--o-table table.qza
--o-denoising-stats stats.qza
--p-trunc-q 8
--p-n-reads-learn 1000000
--p-max-ee-f 2.0
--p-max-ee-r 2.0
--o-representative-sequences rep-seqs.qza
--verbose

I'm unsure why this is happening and would appreciate any insights or advice you can provide.

Thank you for your attention and assistance.

Shunsuke

Hello Shusuke,

Welcome to the forums! :qiime2:

What region did you amplify with PCR?

Is this sequence conserved prefix sequence appearing before or after using this setting?

--p-trim-left-f 53

1 Like

Hi Colin,

Thank you for your interest in my question on the Qiime2 forum.

I amplified the ITS2 region.
And, these sequences are processed using DADA2 with the '--p-trim-left-f 53' setting.

Regards,

Shunsuke

1 Like

Hello @Shunsuke_Ito,

As I understand, the output sequences with identical starting bases are from the representative sequences output by dada2, not the raw sequencing data. Correct me if I'm wrong.

I'm guessing the --p-trim-left-f 53 and --p-trim-right-r 55 options are to remove your primers. It's generally a better idea to do this with qiime cutadapt before hand, passing in the actual primer sequences.

Those two sequences are identical only for the first 21 bases, then you start to see variation. This could well be homology as @colinbrislawn pointed out. A quick blast search seems to support this hypothesis, so little reason to worry :slightly_smiling_face:.

Best of luck with the rest of your analysis.

3 Likes

Hi @colinvwood , @colinbrislawn , I really appreciate your time, thoughts on my analysis.

Before cutting primer sequences using DADA2, cutadapt was performed. As a result, we obtained good results with an average Bitscore of 400 → 600.
Also, as you indicated, we understood that the same sequence of each ASV is homologous.
I'm definitely just getting my feet wet with this type of data and analysis and am grateful for all insight.

The scripts of I did
1 qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux.qza
--o-trimmed-sequences trimmed.qza
--p-front-f TCGTCGGCAGCGTCAGATGTGYAYAAGAGACAGATGCGATACTTGGTGTGAAT
--p-front-r GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCCGCTTAKTGATATGCTTAAA
--p-error-rate 0.1
--p-cores 1

2 qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trunc-len-f 250
--p-trunc-len-r 140
--p-n-threads 8
--o-table table.qza
--o-denoising-stats stats.qza
--p-n-reads-learn 1000000
--p-max-ee-f 2.0
--p-max-ee-r 2.0
--o-represent

Thanks again for your help!! hope all is going well for your own analysis :smile:

We are here to help!

When running cutadapt trim-paired and not using any --p-trim-left settings, do the most commonly observed amplicons change for the better? For ITS I have had trouble finding the best cutadapt settings to use, and trying different configurations has been helpful for me.

Did you include any positive controls with known composition? This is also extremely helpful for choosing the best settings.

1 Like

Thank you Colin,

The sequences I used were raw data that included barcoding regions as well as primers. So I think that using cutadapt trim-paired instead of -p-trim-left 53, which assumed the number of primer sequences, gave me a good bit score because I could precisely control the trimming of the artificial regions. Thanks to all of you for your advice.
Yes, I did not use -p-trim-left, because QC looks better.
I would be glad to have you point out any misunderstandings I have!

Shun

1 Like

And I didn't use positive controls. I had no idea there was such a way.
I don't know the way of including positive control data, but I try it. :smiley:

1 Like

Hello @Shunsuke_Ito,

We noticed that you are providing an entire primer + adapter composite sequence to cutadapt. Linking these two together means that primers aren't removed from sequences when the adapters aren't present. Adapters will only begin to show up if an insert is shorter that the read length. Even if all of your sequences contained the adapter sequences, trimming by using only the primers would still be sufficient because they are upstream of the adapters.

1 Like

Thank you @colinvwood,

Your insight was invaluable in refining the process. I have modified my script as follows:

qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux.qza
--o-trimmed-sequences trimmed.qza
--p-front-f ATGCGATACTTGGTGTGAAT
--p-front-r CCGCTTAKTGATATGCTTAAA
--p-discard-untrimmed true
--p-error-rate 0.1
--p-cores 1

As a result of implementing your advice, the issue of the same sequence appearing at the beginning of each ASV has been resolved. Below is an example of the final sequence. As you can see, the same sequence is no longer recurring. Your expertise is impressive, to have identified the adapter sequence at a glance!

1)TCGGTTGAGGGCACGTCTGCCTGGGCGTCACGCATCACGTCGCCCCCACCAGGCATGGTTGGCCCCACGTCTGCCTGTCTTGTGTTGGGGCGGAGATTGGTCTCCCGTGCCCATGGCGTGGTTGGCCTAAATAGGAGTCTCCTCGCGAGGGACGCACGGCTAGTGGTGGTTGATAAGACAGTCGTCTCGTGTCGTGCGTTTACTTTCTTGAGAGTAGATGCTCTTAAAGTACCCTGATGTGTTGTCTTATGACGATGCTTCGATCGCGACCCCAGGTCAGGCGGGACTACCCGCTGAG

2)TTGGCCGAGGGCACGTCTGCCTGGGCGTCACGCATCGCTGCCCCCCCACGCAACACCCACTATGGATTGTTGCGCATGAGGGAGCACATGCTGGCCTCCCGTGCGCACCGTCGCACGGATGGCTTAAATTCGAGTCCTCGGCGCCTGTCGTAGCGACACTACGGTGGTTGATCCAACCTCGGTACCGTGTCACGACCTCAGCCCGCACACCTCCTCCTTGTGAGCGAGCGAGGACTTCTATGTTGACCCTTTGAACGTTGTCCCCTAAAGATGGCGTTCTCGACGCGACCCCAGGTCAGGCGGGACTACCCGCTGAA

One last thing I'd like to confirm: am I correct in assuming that these sequences are all biological in origin, and any artificial sequences such as adapters, primers, or index sequences have been removed?

Regards,
Shunsuke

Hello @Shunsuke_Ito,

It wasn't me who noticed but another moderator, @SoilRotifer :grinning:. I believe the length of the overall sequence is what was suspicious--but maybe he has all the illumina adapters memorized, who knows :sweat_smile:. Glad to hear it has resolved your problems.

One last thing I'd like to confirm: am I correct in assuming that these sequences are all biological in origin, and any artificial sequences such as adapters, primers, or index sequences have been removed?

I think your command looks reasonable and you've probably removed the vast majority of artificial sequences from your reads. It would be totally reasonable to move forward with these reads in my opinion.

If you want even more confidence in the primer/adapter removal, you could:

  • run the same command with the --p-times option set to 2, and compare the output to your previous output, this can help remove primer-dimer reads if there are any and they were removed by size selection during library prep.
  • search for the adapter trimming sequences by themselves in case they slipped into reads in unexpected places (make sure to do this without --p-discard-untrimmed enabled).

Hi @colinvwood and everyone,

I very much appreciate your advice. I can solve all of my question.
We are studying environmental DNA analysis from a variety of samples including honey, air and water.
Your advice has helped us to further our research.

I wish you all the best in your developments. Have a nice weekend. :grinning: :grinning:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.