Question:
I have imported, demultiplexed, and denoised a set of 16S V4 amplicon sequences using 515F–806R primer. I realized at this point that half of my samples are DNA and half of them are cDNA (from RNA) and I just want to focus on the DNA samples. Reading the forums and the filtering tutorial, I figured out that I could run qiime feature-table filter-samples
with a metadata file containing just the sample ids that I want to keep. So I subsetted my metadata file (‘mapping_file_16S.tsv’) to just the rows containing dna samples (‘mapping_file_16S_dna.tsv’) and produced a filtered table:
qiime feature-table filter-samples \
--i-table table.qza \
--m-metadata-file mapping_file_16S_dna.tsv \
--o-filtered-table id-filtered-table.qza
Then I renamed the IDs for those samples (to drop the ‘.dna’ suffix and just leave the sample ids):
qiime feature-table group \
--i-table id-filtered-table.qza \
--p-axis sample \
--m-metadata-file mapping_file_16S_dna.tsv \
--m-metadata-column Sample-id-new\
--p-mode sum \
--o-grouped-table reindexed-table.qza
Next I wanted to cluster my features by sequence similarity:
qiime vsearch cluster-features-de-novo \
--i-table reindexed-table.qza \
--i-sequences rep-seqs.qza \
--p-perc-identity 0.99 \
--o-clustered-table table-dn-99.qza \
--o-clustered-sequences rep-seqs-dn-99.qza
And I got this error:
Plugin error from vsearch:
Feature ce412904babbb9125249b1622e45378c is present in sequences, but not in table. The set of features in sequences must be identical to the set of features in table.
So my rep-seqs.qza
still contains all of the features from my cDNA samples (which I filtered out) and vsearch
is mad that I haven’t removed them. Following the filtering tutorial, I think I can just filter my seqs using the reindexed table that I just created:
qiime feature-table filter-seqs \
--i-data rep-seqs.qza \
--i-table reindexed-table.qza \
--o-filtered-data filtered-rep-seqs.qza
Now vsearch
runs without errors:
qiime vsearch cluster-features-de-novo \
--i-table reindexed-table.qza \
--i-sequences filtered-rep-seqs.qza \
--p-perc-identity 0.99 \
--o-clustered-table table-dn-99.qza \
--o-clustered-sequences rep-seqs-dn-99.qza
However, I’m concerned, because as I understand it, the rep-seqs object does not contain ID information, so how do I know rep-seqs got filtered to the correct samples? Plus, I just changed the IDs for all of my samples in that feature table, so it must be matching table.qza
to rep-seqs.qza
some other way. I don’t understand how FeatureData[Sequence]
and FeatureTable[Frequency]
and sample-metadata.tsv
all communicate about which samples and features are connected.
- Have I done this process right? Can I be confident I’m now clustering the right sequences?
- More generally, how do these different objects connect to each other so I can make informed choices in the future?
- Is there some way I can confirm this filtering worked correctly? Like in
R
I would see if the sample IDs match using%in%
or by printing the list of IDs, but I’m not sure how to do a similar check in QIIME2.
Thanks!
Qiime:
Running QIIME2 2019.10 in a conda environment.