Filtering samples and seqs for downstream analysis

amorris28 · December 7, 2019, 2:23am

Question:
I have imported, demultiplexed, and denoised a set of 16S V4 amplicon sequences using 515F–806R primer. I realized at this point that half of my samples are DNA and half of them are cDNA (from RNA) and I just want to focus on the DNA samples. Reading the forums and the filtering tutorial, I figured out that I could run qiime feature-table filter-samples with a metadata file containing just the sample ids that I want to keep. So I subsetted my metadata file ('mapping_file_16S.tsv') to just the rows containing dna samples ('mapping_file_16S_dna.tsv') and produced a filtered table:

qiime feature-table filter-samples \
  --i-table table.qza \
  --m-metadata-file mapping_file_16S_dna.tsv \
  --o-filtered-table id-filtered-table.qza

Then I renamed the IDs for those samples (to drop the '.dna' suffix and just leave the sample ids):

qiime feature-table group \
  --i-table id-filtered-table.qza \
  --p-axis sample \
  --m-metadata-file mapping_file_16S_dna.tsv \
  --m-metadata-column Sample-id-new\
  --p-mode sum \
  --o-grouped-table reindexed-table.qza

Next I wanted to cluster my features by sequence similarity:

qiime vsearch cluster-features-de-novo \
  --i-table reindexed-table.qza \
  --i-sequences rep-seqs.qza \
  --p-perc-identity 0.99 \
  --o-clustered-table table-dn-99.qza \
  --o-clustered-sequences rep-seqs-dn-99.qza

And I got this error:

Plugin error from vsearch:

  Feature ce412904babbb9125249b1622e45378c is present in sequences, but not in table. The set of features in sequences must be identical to the set of features in table.

So my rep-seqs.qza still contains all of the features from my cDNA samples (which I filtered out) and vsearch is mad that I haven't removed them. Following the filtering tutorial, I think I can just filter my seqs using the reindexed table that I just created:

qiime feature-table filter-seqs \
  --i-data rep-seqs.qza \
  --i-table reindexed-table.qza \
  --o-filtered-data filtered-rep-seqs.qza

Now vsearch runs without errors:

qiime vsearch cluster-features-de-novo \
  --i-table reindexed-table.qza \
  --i-sequences filtered-rep-seqs.qza \
  --p-perc-identity 0.99 \
  --o-clustered-table table-dn-99.qza \
  --o-clustered-sequences rep-seqs-dn-99.qza

However, I'm concerned, because as I understand it, the rep-seqs object does not contain ID information, so how do I know rep-seqs got filtered to the correct samples? Plus, I just changed the IDs for all of my samples in that feature table, so it must be matching table.qza to rep-seqs.qza some other way. I don't understand how FeatureData[Sequence] and FeatureTable[Frequency] and sample-metadata.tsv all communicate about which samples and features are connected.

Have I done this process right? Can I be confident I'm now clustering the right sequences?
More generally, how do these different objects connect to each other so I can make informed choices in the future?
Is there some way I can confirm this filtering worked correctly? Like in R I would see if the sample IDs match using %in% or by printing the list of IDs, but I'm not sure how to do a similar check in QIIME2.

Thanks!

Qiime:
Running QIIME2 2019.10 in a conda environment.

jwdebelius · December 8, 2019, 10:10am

Hi @amorris28,

Your filtering is correct. Im not so sure about your process behind this (see below), but the two step filtering is right.

What's happening when you filter your feature table is that you are also, on the side filtering your metadata table. So, if i have a feature table with four samples and 3 asvs (its a toy example, bare with me)

asv-id	DNA.1	DNA.2	cDNA.1	cDNA.2
asv-1	100	50	120	40
asv-2	10	0	0	0
asv-3	0	0	10	0

and some kind of repset (which I'll represent as a tabular format as well)

asv-id	sequence
asv-1	AGTCCA
asv-2	CCTATA
asv-3	CCTATC

Then, when you apply your filter for the table you remove samples that are only DNA, you remove all thes amples, but you also remove any row that's completely 0s.

asv-id	DNA.1	DNA.2	cDNA.1	cDNA.2
asv-1	100	50	120	40
asv-2	10	0	0	0
asv-3	0	0	10	0

So, now, your filtered table looks like this:

asv-id	DNA.1	DNA.2
asv-1	100	50
asv-2	10	0

...So, you get an error. Now, when you filter your rep set, what you're actually filtering on is the ASV-id in your feature table. It has nothing to do wtih the samples, only the features present.

You could check this by summarizing the artefacts and comparing the two lists. But, its a good idea to think about in the future. Generally, having a larger rep-set doesn't matter here because in general, people use the repset for phylogeny and taxonomy neither of which care about missing features.

What you're doing is a rare (and I'll be honest somewhat objectionable to me) approach. So, here's my question there. You have ASVs, which are externally valid for the same primer pair and ASV length and give you single nucelotide resolution and are database independent. And then, you're going to cluster them so they're no longer externally valid (i.e. a denovo OTU from study 1 isn't necessarily a de novo OTU from study 2) and sacrifice your resolution. This is especially concerning because if you plan to come back to the cDNA to combine (or compare) the data, if you do independent de novo clustering, the OTUs won't be comparable, since de novo clustering is somewhat unstable and clustering can change with the introduction of new sequences/samples. I would really consider carefully whether this is the best approach for your project, data, and goals.

Best,
Justine

amorris28 · December 8, 2019, 9:03pm

Hey @jwdebelius, thank you for the response. I think I understand now. filter-seqs is simply removing the ASVs that are no longer present in the feature table, so my filtering approach is okay.

As for the the second part, thank you for your question, because I have been grappling with the best way to do this. My research question is focused on the phylogenetic conservation of traits. So my objective is to cluster reads at different phylogenetic levels and then test the relationship between community composition and other variables in my metadata. I could imagine a number of ways of doing this, but the most direct and least biased approach (in my mind) is simply grouping reads by their 16S sequence similarity. Your concerns about taking ASVs (which are externally valid, etc. ) and replacing them with something without those benefits is sort of irrelevant (or a side effect of) my broader question.

As for the cDNA/DNA problem - the cDNA data were collected to address other research questions and it's not clear to me how to use them for this particular part of my research. But it sounds like what you're suggesting is I should keep the cDNA samples in during clustering in case I'm interested in looking at them downstream of this step? In that case, I might consider keeping them.

Hopefully that clarifies the reasoning behind my approach. Happy to hear if you have any further thoughts.

Thanks,
Andrew

jwdebelius · December 9, 2019, 8:50am

Hi @amorris28,

Why not optimize for tree building (it sounds like youre working with something other than 16s) and then work with phylogenetic approaches instead? Ive really liked PhyloFactor (which is in bioconductor but not QIIME) as a way to get really interesting patterns and associations between my community structure and metadata. You retain resolution, you get the phylogeny, and you're better able ot support reproducible science in the community and for yourself.

Best,
Justine

system · January 9, 2020, 2:50pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.