dada2 --p-derep-prefix

Dear community,

Is there any parameter that can avoide 'merges sequences with identical prefixes, which means if a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them' in dada2?

Below are the details:

We are now analyzing data sequenced by IonTorrent based on 16S Metagenomics Kit. We have tried to denoise reads using dada2 after cutadapt. Dada2 is much faster than q-score, followed by vsearch dereplicate-sequences. So we choose dada2 first.

But we do find dada2 merges sequences with identical prefixes, which means if a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. This result different genuses have the same taxonomy information into their belonging fimaly. For example,

We have two below sequences:

sequence1
TACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGGTGAGGTTAATAACCTCATCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATAC
sequence2
TACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGATAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATAC

After dada2, they are all clustered into the identical prefixes' TACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTT', but the other different information (the tail) is lost.

This results both the two sequences are algined to Gammaproteobacteria, but not the related family.

In the other method, vsearch dereplicate-sequences, it has one parameter that can be adjust:
--p-derep-prefix / --p-no-derep-prefix (Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant.)

And this allow us to make this parameter FALSE to avoide dada2 situation. But it really costs much more time than dada2. As there are chimera filtering and other filterings need to be following. And those steps cost much more time.

Can anyone help? Thanks in advance so much!

Best
Joy

Hello Joy,

Can you help me find this setting? I don't see it in the dada2 denoise-pyro plugin or the full DADA2 manual (PDF).

Just to check, are you running cutadapt -> vsearch derep -> dada2?

Thanks!

Hi Colin,

  1. Can you help me find this setting? I don't see it in the dada2 denoise-pyro plugin or the full DADA2 manual (PDF).

Yes, it is dada2 denoise-pyro plugin, there is no parameter as "--p-derep-prefix", sorry for the mistake.

  1. Just to check, are you running cutadapt -> vsearch derep -> dada2?
    No, we run in two ways to find a better way for our data analysis:
    A. cutadapt --> dada2
    B. cutadapt --> q-score --> vsearch derep
    there is '--p-derep-prefix' in vsearch derep, but no such function as '--p-derep-prefix' in vsearch derep to avoide merges sequences with identical prefixes, which means if a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them in dada2

Thank you for telling me more! I know how to solve this problem!

This is one good way to run dada2:

A. cutadapt --> dada2

This is one good way to run vsearch:

B. cutadapt --> q-score --> vsearch derep --> vsearch cluster-features-de-novo

The dada2 plugin does dereplication by itself and expects the sequences to NOT be dereplicated. So running vsearch derep --> dada2 will break dada2.

They are clustered by prefix after vsearch derep. If you want to avoid this, just skip the vsearch step!

I hope this helps. Let me know if you have other questions about processing pipelines.