I have been using DADA2 for sequence analysis and I have encountered something puzzling. In all of my output sequences, there are several dozen nucleotides that are identical at the start of each sequence.
As you can see, the same sequence is present at the start of both sequences. I'm unsure why this is happening, and I would appreciate any insights or advice you can provide.
As I understand, the output sequences with identical starting bases are from the representative sequences output by dada2, not the raw sequencing data. Correct me if I'm wrong.
I'm guessing the --p-trim-left-f 53 and --p-trim-right-r 55 options are to remove your primers. It's generally a better idea to do this with qiime cutadapt before hand, passing in the actual primer sequences.
Those two sequences are identical only for the first 21 bases, then you start to see variation. This could well be homology as @colinbrislawn pointed out. A quick blast search seems to support this hypothesis, so little reason to worry .
Before cutting primer sequences using DADA2, cutadapt was performed. As a result, we obtained good results with an average Bitscore of 400 → 600.
Also, as you indicated, we understood that the same sequence of each ASV is homologous.
I'm definitely just getting my feet wet with this type of data and analysis and am grateful for all insight.
The scripts of I did
1 qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux.qza
--o-trimmed-sequences trimmed.qza
--p-front-f TCGTCGGCAGCGTCAGATGTGYAYAAGAGACAGATGCGATACTTGGTGTGAAT
--p-front-r GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCCGCTTAKTGATATGCTTAAA
--p-error-rate 0.1
--p-cores 1
When running cutadapt trim-paired and not using any --p-trim-left settings, do the most commonly observed amplicons change for the better? For ITS I have had trouble finding the best cutadapt settings to use, and trying different configurations has been helpful for me.
Did you include any positive controls with known composition? This is also extremely helpful for choosing the best settings.
The sequences I used were raw data that included barcoding regions as well as primers. So I think that using cutadapt trim-paired instead of -p-trim-left 53, which assumed the number of primer sequences, gave me a good bit score because I could precisely control the trimming of the artificial regions. Thanks to all of you for your advice.
Yes, I did not use -p-trim-left, because QC looks better.
I would be glad to have you point out any misunderstandings I have!
We noticed that you are providing an entire primer + adapter composite sequence to cutadapt. Linking these two together means that primers aren't removed from sequences when the adapters aren't present. Adapters will only begin to show up if an insert is shorter that the read length. Even if all of your sequences contained the adapter sequences, trimming by using only the primers would still be sufficient because they are upstream of the adapters.
As a result of implementing your advice, the issue of the same sequence appearing at the beginning of each ASV has been resolved. Below is an example of the final sequence. As you can see, the same sequence is no longer recurring. Your expertise is impressive, to have identified the adapter sequence at a glance!
One last thing I'd like to confirm: am I correct in assuming that these sequences are all biological in origin, and any artificial sequences such as adapters, primers, or index sequences have been removed?
It wasn't me who noticed but another moderator, @SoilRotifer. I believe the length of the overall sequence is what was suspicious--but maybe he has all the illumina adapters memorized, who knows . Glad to hear it has resolved your problems.
One last thing I'd like to confirm: am I correct in assuming that these sequences are all biological in origin, and any artificial sequences such as adapters, primers, or index sequences have been removed?
I think your command looks reasonable and you've probably removed the vast majority of artificial sequences from your reads. It would be totally reasonable to move forward with these reads in my opinion.
If you want even more confidence in the primer/adapter removal, you could:
run the same command with the --p-times option set to 2, and compare the output to your previous output, this can help remove primer-dimer reads if there are any and they were removed by size selection during library prep.
search for the adapter trimming sequences by themselves in case they slipped into reads in unexpected places (make sure to do this without --p-discard-untrimmed enabled).
I very much appreciate your advice. I can solve all of my question.
We are studying environmental DNA analysis from a variety of samples including honey, air and water.
Your advice has helped us to further our research.
I wish you all the best in your developments. Have a nice weekend.