To cluster or not to cluster?

shira · May 30, 2019, 11:56am

Hello - I would like to bring up a topic that has been discussed before (A quick clarification on clustering) and provide an example from my own analysis.

I am looking at one particular taxonomic group within environmental samples of bacterial populations. The method of denoising is dada2, so I am assuming that OTU clustering was not necessary since sequencing errors are accounted for during the denoising process. However, this analysis resulted in many OTU's being defined, some of them quite rare. When I run the same analysis again, and add a step of OTU clustering at 99%, the data collapses into only a few OTU's (see snapshots below). I am more than happy with the original data showing high diversity, and it is to be expected of environmental samples. But since much of the diversity seems to be driven by very minor modifications to the sequences, I would like to understand if I this does not suggest that relying on dada2 alone and skipping clustering may produce an artifact showing high diversity, where in fact what we have is minor sequencing errors.

On top of this, a close look at the otu's produced by dada2 shows that identical sequences are not defined as the same OTU if one runs a bit longer, but they do cluster into one if you add a clustering step. In my example here the OTU ending with cd7 shows up 89 times but the sequence is identical (but a bit shorter) to that of the most common otu ending 640, which shows up 93,000 times. After 99% clustering the short otu disappears, I am assuming it merges with the common one.

I hope I was able to explain myself - any clarification would be greatly appreciated!

lca123 · May 30, 2019, 12:17pm

I am on it to so I'll follow your thread. Had the same question to myself already.

jwdebelius · May 30, 2019, 1:05pm

Hi @shira,

In general, I'm a big proponent of leaving denoising alone and not re-clustering it.
With 99% clustering, you're still losing a lot of potential information. Recently, Ive been seeing some interesting behavior from ASVs that differ by only a few nucleotides in a 400bp sequence. If I clustered at 99%, I definitely would not be able to see these!

However, I think in your case, specifically, you're running into one of the limitations.

Because denoising works on absolute sequence variants, it treats each sequence as unique. This includes length. The recommend approach is to trim your sequences to a specified length so that you're not inflating based on a length distribution. Therefore, my recommendation for you would not be to cluster are denoising, but simply to pick length and pass that to --p-truc-len. (For some reason, I thought this was a required parameter ¯\_(ツ)_/¯).

Best,
Justine

shira · May 30, 2019, 1:23pm

Thanks @jwdebelius - what a clear explanation. I like your solution and I will give it go!

shira · May 30, 2019, 1:38pm

There is one thing I still don't understand - my samples were sequenced using the pair-end approach - how does it happen that the same sequence results in amplicons of varied length? Do truncations always occur at the 5' end of reverse reads and never at the forward? If not, how would truncating the reads solve this problem when variation in length is introduced at the 5' end of the forward read? Thanks

Mehrbod_Estaki · May 30, 2019, 8:10pm

Hey all,
I'm always intrigued by this topic so I just thought I'd throw in my 2 cents as well, if nothing at least I'll have somewhere to reference in future posts. But I look forward to reading other views on the matter.
First just to clarify terminology because sometimes using the wrong terms can add extra confusion, especially to newcomers. The product of denoising methods such as DADA2, DEBLUR, UNOISE, (and perhaps MED?) are Exact Sequence Variants (ESV). These different methods all arrive at these variants slightly differently and so they have different names. DADA2 ESVs are called Amplicon Sequence Variants (ASV), Deblur calls them sub-OTUs (sOTU), and UNOISE method calls them zero-radius OTU (zOTU). It use to bother me that we didn't call them all the same thing but the fact that by simply calling them their correct name I can figure out what method produced them has grown on me. You also hear the term feature thrown out there to refer to these also but that's a different discussion. The common idea behind all of these is that they deal with the exact sequences itself at all time and they don't worry about similarity, lineage etc. As @jwdebelius nicely described even a single nt difference between them, be it addition, deletion, or mutation will lead to calling a new ESV; since we are discussing DADA2, I'll stick with ASV.

Now, what we do with the ASVs (and everyone agrees, we should all start with ASVs) from here on is completely question/data driven. For example: collapsing the ASVs to the their closest taxonomy, functional clades, phylogenetic clades etc. and of course clustering down to OTUs at some similarity threshold. These are all just methods that we can choose to find patterns and make our data more human digestible. They all have their advantages and disadvantages. Take OTU clustering for example, a disadvantage is as @jwdebelius described, you may lose interesting trends when you collapse ASVs, but one could argue that an advantage of clustering would be you don't falsely split the same organism into multiple ones thus inflating its influence on community metrics. This can happen when you consider some organisms have multiple 16S gene copies and those copies differ from each other by 1 or more nts. Even messier is that these differences can be identified in some hypervariable regions but not others. The copy number problem is even crazier in fungi. when those differences can reach dozens or more. Your choice for all this then must be question-driven taking into account the advantages and disadvantages of each method you choose.

Coming back to some of the unanswered questions now. --p-trun-len is not necessary for dada2 but is required for Deblur and the recommendation to truncate to equal length has more to do with quality control, the denoising algorithm's specific requirements, and chimera detection than biological importance. Of note, setting truncating is not recommended for ITS data however due to the large natural variation of that region. In Qiime2 you have the option of trim/truncating at either the 5' or 3' ends on either forward, reverse, or both reads. Variable lengths can occur because of natural variation in the region, sequencing errors, and chimera formations (and probably some others). Variable length is more likely to be introduced when considering longer target regions such as those reads from paired-ends. Trimming your reads to a fixed length is one option to deal with the so called identical reads being called 2 different ASVs issue, but so is leaving them as is. For example consider these 2 ASVs :

 AATTGGCCAATT
GAATTGGCCAATT

These may be different strains of the same species with 1nt difference, different species with 1nt difference in this particular region (more differences might occur elsewhere), or the different 16S gene of the same organism which happen to differ from each other by 1nt such as S. aureus. There is really no way to be sure and trimming that first nt to get equal length doesn't resolve the ambiguity either, simply covers it up and pretends we know. This is a limitation of short amplicon sequencing in general with no real solution at the moment. However, there is some comfort to be found in the fact that if such limitations confound your data, these should occur evenly across all your samples/groups so at the end of the day even if your data is not perfect, the patterns between your experimental groups should still stand out.
I think I'll stop there.

lca123 · May 31, 2019, 7:05pm

To add something because I'm also interested on it... I haven't been clustering but keeping the ASVs even when they differ by 1,2,3.. nts, but in fact, there is no sollution still as @Mehrbod_Estaki mentioned.
However I very often work with isolated bacteria (mostly Bacillus), which are likelly an ideal positive control, and that helped me to find that these ASVs use to came from the same bacteria. I am aware that could be an artifact of my bacteria and if we sequence other genus maybe that wont happen. But it's been working.
Another one to consider is that, as you pointed out, there are always one or two ultra-frequent ASVs and other low-frequent which in general doesn't make so much in terms of abundance. I highlighted in general because we use a mock made of a few isolated species and sometimes those low-frequent ASVs doesn't show up in the mock samples but do so in other samples through the set. Those cases make me take a deeper look and blast them as a second check but still they keep classifying as the same species or a very close one.

shira · May 31, 2019, 7:55pm

Thanks @lca123 and @Mehrbod_Estaki for sharing you experience. It seems like the consensus is to leave the ASVs, and assume they come from different origins, even if very similar. It's ok that the results are not perfect but still consistent. However, in some cases ignoring the problem can really impact the interpretation of the results. Like in my case, where we focus on one taxonomic group and seem to get inflated complexity due to this issue. @Mehrbod_Estaki example really highlights this problem:

My question is - as far as sequencing artifacts go - has there beev an artifact described causing the reaction to skip a few nucleotides in the beginning? It would be helpful to have that knowledge and use it to weed your ASVs in cases where you can't afford to just let things be, like when you are looking at a single taxonomic group, or single organism.

lca123 · June 3, 2019, 12:34pm

That's something I haven't thought about and it does not look impossible. It's worth to check it in the next data we get.

cduvallet · June 3, 2019, 3:46pm

Jumping in here to give my two cents on this -- another option to reduce the total number of taxa while maintaining meaningfully different ASVs (ones which have different behavior across samples but may differ by only a few nt) without necessarily clustering at 99% is to use distribution-based clustering, described in the Preheim et al 2013 paper and available as a plugin.

Distribution-based clustering uses both genetic similarity and distribution similarity in the clustering. If you use a 99% identity threshold, any ASVs which are more than 1% different will not be clustered no matter what. However, ASVs which are 99.99% similar but which are distributed differently across your samples will also remain separate.

The paper about the python 3 implementation (which is the implementation wrapped by the plugin) has an overview of the algorithm which I find pretty clear [just replace OTU with ASV]:

Make the most abundant sequence an OTU.

For each sequence (in order of decreasing abundance), find the set of OTUs that meet “abundance” and “genetic” criteria. The abundance criterion requires that the candidate sequence be some fold less abundant than the OTU (e.g., so that it can be considered sequencing error). The genetic criterion requires that the candidate sequence be sufficiently similar to the OTU’s sequence (e.g., so that it can be considered sequencing error or part of the same population of organisms).

If no OTUs meet these two criteria, make the candidate sequence into a new OTU.

If OTUs do meet these criteria, then, starting with the most genetically-similar OTU, check if the candidate sequence is distributed differently among the samples than that OTU. If the distributions are sufficiently similar, merge the candidate sequence into that OTU. Specifically, add the candidate sequence’s counts across samples to the OTU’s counts.

If the candidate sequence does not have a distribution across sample sufficiently similar to an existing OTU, then make this sequence a new OTU.

Move on to the next candidate sequence.

I'm still working to update the plugin to qiime2-2019.1 and there aren't too many users of the plugin yet, so please let me know if you come across any issues using it!

colinbrislawn · June 3, 2019, 4:41pm

I'm really glad you brought up your specific use case:

All these denoising programs have parameters you can... uh... parameterize to best match your biological system. Settings like --p-min-fold-parent-over-abundance and --p-chimera-method in dada2 can be set differently in different systems.

So while in general I agree with Mehrbod and would keep these as different dadas...

 AATTGGCCAATT
GAATTGGCCAATT

... if I was sequencing a microbial isolate, I would probably want to put these into the same feature.

Thanks for helping us stay focused on the real biology

Colin

Nicholas_Bokulich · June 3, 2019, 4:54pm

Unless if those slightly dissimilar ASVs are true biological differences, arising from multiple copies of the 16S rRNA gene present in your species. If you see many variants associated with the same species, there is likely a problem, but 1-2 variants within the same isolate is not uncommon...

Have you sequenced the genome of your isolates to confirm that multiple (variable) gene copies are not present? You don't need to do this, I am just raising a "what if" in the event that you don't have genomes for your isolates.

shira · June 3, 2019, 5:31pm

After staring some more at my raw data, I determined that in my case there are only 6 true biological ASVs but these translate into 20 ASVs after dada2 due to variations in length. It looks like the shorter variants also lack the primer sequence, so I plan to go back to the raw data and use cutadapt to filter out reads that don't contain the primer sequence. I am hoping this approach will solve the problem. Does this make sense?

I am not certain how this occurres and why it's not a problem for everyone. It may have something to do with the way the sequencing libraries are constructed (cloning, mr. dna), but it looks like it's not a good idea to ignore this issue when it arises.

timanix · June 3, 2019, 6:24pm

Hi, all.
It is very interesting that I was looking into the same issue independently. Actually, I decided to check if different ASVs are different from each other as a biological sequences and one day later I read your question here. So I was tracking this discussion for a while.
In my case, I found a numerous ASVs that were assigned to the same taxonomy. For one of my bacteria, the amount of ASVs in feature table was outstandingly greater than in all other cases.
I was wondering, should I cluster my table with VSEARCH.
After some discussion about it with my supervisor, we decided not to use clustering.
I checked in MAFFT all sequences, assigned to the same taxonomy. Not like in your data, all my sequences were exactly the same length, but contained variants. The amount of this variants was big enough for me to keep this sequences as different ASVs.
And now I'm really curious about your case.
I'm just wondering, where is the border line between cases, in which one should merge ASVs, and in which he should not. It's sound reasonable to merge identical sequences in one ASV if they differ only in length, but which number of SNPs is appropriate for keeping them separated?

PS. Does your primer sequences contain ambiguouty symbols? I had some issues with my sequences length until I replaced this symbols with "N". I still have some SNPs in places where no variants should be. I don't know, if it is an issue with my primers set, or it's derived from the sequencing step. I also wanted to remove my primers inside DADA2 just providing corresponding number of nucleotides to cut, but surprisingly they weren't aligned nicely between different sequences when I checked it.

lca123 · June 3, 2019, 6:25pm

They are to be sequenced but not sure when

colinbrislawn · June 3, 2019, 7:58pm

Hi @shira

That sounds like a good plan to me. Once the variable primers are gone, all the reads should be the same length in the same ASVs.

Another option is to make ASVs using dada2, then cluster at 100% identity using vsearch cluster-features-de-novo. Vsearch clustering ignoes gaps at the ends, so this should ignore the primers.

Colin

shira · June 4, 2019, 7:30am

Thanks @colinbrislawn - this never occurred to me as a possibility - its a great solution to cluster at 100% - this way I don't have to lose all these primer-less reads, I think they compose about 10% of my total reads...

shira · June 4, 2019, 8:04am

In general it seems like it's almost always better to leave the ASV's as they have been defined by dada2 unless you have reason to suspect that the variants are a result of a technical error and do not represent the biology, like in my case.

If you are not sure, perhaps what @cduvallet suggested here - distribution based clustering - could help since it takes into account the biological behaviour of the ASVs, not only the sequence.

Yes, my primer contains degenerate nucleotides, but that is not the source of my problem, as the primers were removed in dada2 based on position, and were not part of the analysis. You certainly want to remove your primers early on or they will cause all sorts of problems downstream. Cutadapt is another option for doing that, and it is based on the sequence not the location, and can handle ambiguous IUPAC nucleotide code.

By the way, what do you mean by:

shira · June 4, 2019, 8:09am

What a great thread this has become! Thank you all for your contribution - I learned so much

timanix · June 4, 2019, 8:12am

Sorry for bad English.
Just searched for a partial sequence of my primers in representative sequences and noticed that they are not located in the same positions, some of them have additional extra nucleotides upstream.