Filtering for private ASVs


(Devon O'rourke) #1

There are myriad ways to filter features and samples in QIIME. I’m wondering if there is a way within QIIME to determine the amplicons with which are distinct, given some metadata parameter. The documentation regarding identifier or metatdata approaches seem like filters whereby you can select (or discard) based on one or more string types, but what I’d like to do is compare across strings in hopes of finding those ASVs which are unique to one or more of whatever identifier I select.

My project involves several sequencing runs of arthropod COI sequence data. Within these runs there were positive and negative controls, as well as samples from which the insect COI sequences are expected to be very distinct (for example, some samples are from bird guano from Hawaii, while others are from bat guano from Vermont).

The reason why I’m curious about distinct/private ASVs per group is because I think it can help assist in quality control. While there is a chance that there will be some overlap, I would expect the greater proportion of overlaps to be due to cross contamination, rather than shared consumption of prey items (though, it would be pretty neat if a Honeycreeper in Hawaii was eating the same beetle as a Big Brown in Burlington).

Does anyone know of an approach within QIIME to do this?
I’m resorting to exporting the merged feature tables and doing this in R at the moment.

Thanks for your consideration

Do you guys still remove singletons or doubletons these days
(Mehrbod Estaki) #2

Hi @devonorourke,
Just taking a crack at this, hopefully I understood your goals correctly.
So you are interested in a) determining some unique ASVs that are exclusive in one group but not in another? And b) looking for ways to track the source of potential contaminants across samples? The latter sounds like something the SourceTracker tool in Qiime1 (though admittedly I’ve never used this personally). Not sure if something comparable exists in qiime2, perhaps something does exist within the qiime2 classifiers. :man_shrugging:

As for your first goal, I was thinking perhaps one approach would be through the use of feature table core-features action.
Whereby you first merge your samples from different locations, identify features that are shared across all samples, then filter those identified features from your original ‘separated’ tables. Whatever remains is then exclusive to that subset. It’s not pretty but probably will work.
A = Hawaii
B = Burlington
Merge A+B = C
Run core-features on C to get shared features = D
Then subtract D from A/B tables. so…
A - D = Exclusive A features
B - D = Exclusive B features.


(Nicholas Bokulich) #3

That’s not really what sourcetracker is for, and it would not be clear if source features are actually specific to one group (e.g., they could be missing from some group members or just be more abundant in that group than others, I believe).

q2-sample-classifier could identify characteristic features but again it would not be clear if these are unique to your groups (the most important features probably are! But you would need to check to confirm). Still, it could be a fine place to start.

Honestly, I think that exporting and working in R or python is probably just the easiest thing to do since this really is what you might call a custom job. May be useful for others, though — feel free to share your solution or put together a method to contribute to QIIME 2 :wink:

(Justine) #4


I know I’m jumping in late, but this is very much related to the Checkerboard Score (Stone and Roberts, 1990) I mentioned in a different post.


(Nicholas Bokulich) #5

Oh cool — the checkerboard score will relate to the # of unique features in each group, though, and will not identify those unique features — is that correct?

At the end of the day, these could both be useful additions to a QIIME 2 plugin: checkerboard score and a simple method to report features that are unique to individual groups (maybe something slightly more sophisticated than that, e.g., a “locally core microbiome”) — any thoughts?

(Devon O'rourke) #6

I have a few follow ups, and am working on some visualizations to share this morning. Hopefully we can continue this conversation as the day unfolds if people have interest.

In my earlier question about read depth filtering, I had posted a per-library plot that suggested several instances in which sample read depth followed a (fairly) bimodal distribution.I was curious about what it looked like after DADA2 filtering was implemented; the following plot shows a comparison. Apologies when looking between the two figures as the earlier post didn’t contain every library, and this one contains all 12 runs:

My takeaway is that DADA2 performed as expected. About 100 of the initial 4,600 samples were dropped entirely, as these contained low numbers of total sequences per sample, and these were likely either chopped because (a) they had too few to even be considered by DADA2 and were discarded on the front end, or (b) contained sufficient read depth, but were of low quality or chimeric, and tossed on the back end of the pipeline.
What’s clear is that the bimodal distributions of some libraries remain, though perhaps it’s not as pronounced now.

So what gives!? @Nicholas_Bokulich asked earlier about whether my samples were split in terms of biomass, and the short answer is no, but the long answer is I have no idea. Why? Because I’m dealing with bat poop, so I have no really good way of determining the arthropod biomass in a given bat turd. It probably varies, but to what extent, I haven’t the slightest clue. There is decades of previous work describing using tweezers and microscopes dissecting out the arthropod parts, and folks have reported sample biomass (or volume) in those samples, but I’m not sure how variable it is. Nor do I think the reported biomass has much bearing on what gets amplified - that’s why we’re doing the molecular stuff, right?

The biomass question led me to wondering about what the read depth distributions are like when thinking about them on a sequence variant level, rather than a per-sample level.
The following plot represents the entire dataset aggregated into one pool (though remember, all samples were sequenced across 12 different runs). It’s summarizing the read depth per ASV rather than per sample. And boy, do I have a lot of infrequently observed ASVs.


So the x-axis represents the number of reads attributed to a particular ASV, summed up among all samples in the entire dataset. This is to say the vast majority of my ASVs are showing up less than 1000 times. As there are about 4,500 samples which passed DADA2 filtering, that’s pretty amazing to me.
I wonder what folks see in their microbial datasets? How rare is rare?

Okay, now last plot for this post. Same as the above per-ASV plot, but subsetted (faceted?) by sequencing run.

I like this, in that it shows me that rare ASVs aren’t creeping into my dataset as a function of some wonky MiSeq run or two. Oh, and the run for library 8.1 sucked.

So, at this point I haven’t addressed anything that @jwdebelius mentioned/suggested regarding the checkboardness. I haven’t addressed anything that @Mehrbod_Estaki wrote in terms of my original point of this thread - private ASVs.
Nevertheless I think these plots begin to get after my overall motivation - trying to figure out what to keep in a dataset. The point of the private ASVs is to also inform filtering parameters.

More to come. Excited to hear any comments you might have.

(Justine) #7

The output I’ve gotten has been a list of features that map significantly to one partition or the other. Its been a while since Ive run the QIIME 1 code, so I dont remember the format entirely. But, the idea is on a feature by feature basis, to determine which features belong to which partition.

(Justine) #8

Hey @devonorourke,

It looks like you’re losing a lot of your low abundance samples (if I’m reading this correctly.) Possibly because DADA2 has filtered out those rare reads as contaminants or chimeras. So, it potentially has to do with the interaction.

The place I’m curious (and sorry for what is potentially a new tangent) might be Chimera filtering. If you take a unimodal and bimodal plate and then do chimera filtering for the combined data outside of DADA2 and then re-filter, does that change your distribution? (Or has someone else done this and I’m just not looking hard enough?)

In terms of rare/low abundance ASVs, I recently did a similar plot on a combined run of 9 plates.


Most of my features are rare, and low abundant. This has been my experience with other datasets in other sample types. (Current experiment is human oral; Ive done a fair bit of human fecal). I think with even relatively permissive filtering criteria, I lost like 1/3 of my features.

Okay, and one other thought I had about looking on a per-run basis might be something like Gneiss and a heat map using sequence-based clustering. On a few occasions where I’ve had devastating strong technical effects, it’s been a useful visualization.


(Devon O'rourke) #9

Second bit of info now, and migrating away from histograms to scatterplots.

I was curious about how the frequency of detection (binary) of an ASV in a sample was correlated with the read depth of that ASV. In other words, for a particular ASV, let’s sum up the entirety of the reads associated for that ASV across all samples (read depth), and count how many times it occurs across all samples (the frequency). What I was hoping that would jump out would be outliers along either x or y axis. And well, not so much?

My thoughts about the outliers… really interested in other people’s interpretations:

  • x axis outliers - those to the far right and along the bottom… do these represent PCR biases? the kinds of ASVs which amplify really well, but perhaps are relatively rare as far as a dietary component (given that they are in so few samples?) How else to intepret those ASVs which are rare but dominate the read abundance?
  • y axis outliers - those to far left and along the top… do these represent contaminants? things that amplify poorly? both? in terms of “contaminants”, I think these might be potentially both sequence artifacts via barcode switching as well as potential low-level biological contaminants in our PCR or DNA procedures.

If you’re curious, the numbered data points represent an alias for each ASV which contained more than a million reads (rather than using the hash assigned to each sequence variant). All 11 of these data points are from insects I expect from my bat guano samples - all ASVs (with one exception appear to align with high confidence to one or two arthropods, so I don’t think any of these high abundant ASVs are chimeric.
That would be, perhaps super infrequently, a possibility though, right? So it’s not like this plot is foolproof at identifying outliers. I wonder how often folks see abundant chimeras creep into their dataset?

I wanted to extend this idea to a more logical point in thinking about data filtering specifically with the negative controls that I sampled. Because I had about 8 negative controls for every 96 well plate, and had something like 70 plates that got processed, there ended up being 243 samples which were included in sequencing and generated > 0 reads which passed DADA2 filtering.

So, the following plot uses the same inputs as the above plot, but is focused solely on ASVs present in negative controls:

[EDIT: apologies, I had to rework the above image; the below comment stays the same though]

What’s nice (I think?) is that you can really see a distinction between a handful of ASVs which are generating lots of reads relative to the vast majority of ASVs which are at super low read depths across a range of detection abundances. I’m not clear whether I should be discarding an ASV which is present in many samples but at very very low frequencies yet - basically all the left-sided ASVs. Curious about others thoughts.

My thinking at the moment is that I should be concerned with just the handful to the right. If you look at the ASVid’s between the two plots, the previously identified 11 ASVs which had more than 1 million reads all show up in the negative control dataset. While they’re not particularly abundant in terms of total read depth across the samples, they do vary quite a bit in terms of detection abundance (how many samples we see a particular ASV). If one would expect that barcode switching occurs at some background level, I’d expect my most abundant amplicons in my entire dataset to show up sort of like what we see here - really low read depths, but also probably fairly frequent.

If you wanted to know how this varied as a function of the library - well I have a plot for that too!

What we see pretty clearly(ish) is that the contaminant outliers are library specific. My take away from this plot is that I might try to eliminate those ASVs from the dataset, but perhaps removing them only on a per-sequencing-run basis. Curious about peoples thoughts on that strategy.


(Devon O'rourke) #10

In addition to the negative controls, I also have somewhere between 1-3 positive controls for all but one of my twelve sequencing run. This seems to me to be a another nice way to evaluate contamination.

If we examine only the mock community, the data are quite stark: you’re almost always in high abundance, or you’re super low (what I’d hope for, right?!) - note the two bands illustrate how some sequencing runs had multiple mock communities with partially overlapping (expected/known) amplicons:

This got me to wondering how often my negative control ASVs were creeping into the mock dataset. No plot here - they’re pretty much all in there. But was that because the mock samples were getting contaminated with my negatives, or were my super-high amplifying positive controls getting into the negative controls? Who’s contaminating who!?

I started by finding the ASVs these two groups have in common - finally I’m getting at the point of shared vs. private ASVs!! I then recreated the same plot as in my previous post which was looking at all the negative control samples, figuring out how frequency of sample detection was correlating with the number of reads a particular ASV generated among all negative controls. So, this is a plot exclusively of negative controls samples, and I’m highlighting any ASV shared between the mock community and negative controls. And wow, there sure are a lot fo common points!

But the real thing to highlight, in my opinion, is the mock community ASVs which are in greater abundance when we consider the mock samples on their own (as in, the ASVs in the first plot of this post). Highlighting everything in common doesn’t indicate which way the contamination is likely originating. So this next plot uses the same data as the one immediately above, maintains the same highlighting for shared ASVs (at any read depth), but then further highlights the ASVs in the mock community which had at least 10,000 reads across all mock samples - that is, the green points here are the ones in the two horizontal bands in the first plot of this post. And indeed, they don’t really contribute much to the negative control samples:


It looks to me like the highly amplifying positive controls do not contribute to overall contamination in any significant way (phew).

Thanks again for any insights/comments!