Dear community,
Is there any parameter that can avoide 'merges sequences with identical prefixes, which means if a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them' in dada2?
Below are the details:
We are now analyzing data sequenced by IonTorrent based on 16S Metagenomics Kit. We have tried to denoise reads using dada2 after cutadapt. Dada2 is much faster than q-score, followed by vsearch dereplicate-sequences. So we choose dada2 first.
But we do find dada2 merges sequences with identical prefixes, which means if a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. This result different genuses have the same taxonomy information into their belonging fimaly. For example,
We have two below sequences:
sequence1
TACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGGTGAGGTTAATAACCTCATCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATAC
sequence2
TACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGATAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATAC
After dada2, they are all clustered into the identical prefixes' TACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTT', but the other different information (the tail) is lost.
This results both the two sequences are algined to Gammaproteobacteria, but not the related family.
In the other method, vsearch dereplicate-sequences, it has one parameter that can be adjust:
--p-derep-prefix / --p-no-derep-prefix (Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.)
And this allow us to make this parameter FALSE to avoide dada2 situation. But it really costs much more time than dada2. As there are chimera filtering and other filterings need to be following. And those steps cost much more time.
Can anyone help? Thanks in advance so much!
Best
Joy