I have imported, demultiplexed, and denoised a set of 16S V4 amplicon sequences using 515F–806R primer. I realized at this point that half of my samples are DNA and half of them are cDNA (from RNA) and I just want to focus on the DNA samples. Reading the forums and the filtering tutorial, I figured out that I could run
qiime feature-table filter-samples with a metadata file containing just the sample ids that I want to keep. So I subsetted my metadata file (‘mapping_file_16S.tsv’) to just the rows containing dna samples (‘mapping_file_16S_dna.tsv’) and produced a filtered table:
qiime feature-table filter-samples \ --i-table table.qza \ --m-metadata-file mapping_file_16S_dna.tsv \ --o-filtered-table id-filtered-table.qza
Then I renamed the IDs for those samples (to drop the ‘.dna’ suffix and just leave the sample ids):
qiime feature-table group \ --i-table id-filtered-table.qza \ --p-axis sample \ --m-metadata-file mapping_file_16S_dna.tsv \ --m-metadata-column Sample-id-new\ --p-mode sum \ --o-grouped-table reindexed-table.qza
Next I wanted to cluster my features by sequence similarity:
qiime vsearch cluster-features-de-novo \ --i-table reindexed-table.qza \ --i-sequences rep-seqs.qza \ --p-perc-identity 0.99 \ --o-clustered-table table-dn-99.qza \ --o-clustered-sequences rep-seqs-dn-99.qza
And I got this error:
Plugin error from vsearch: Feature ce412904babbb9125249b1622e45378c is present in sequences, but not in table. The set of features in sequences must be identical to the set of features in table.
rep-seqs.qza still contains all of the features from my cDNA samples (which I filtered out) and
vsearch is mad that I haven’t removed them. Following the filtering tutorial, I think I can just filter my seqs using the reindexed table that I just created:
qiime feature-table filter-seqs \ --i-data rep-seqs.qza \ --i-table reindexed-table.qza \ --o-filtered-data filtered-rep-seqs.qza
vsearch runs without errors:
qiime vsearch cluster-features-de-novo \ --i-table reindexed-table.qza \ --i-sequences filtered-rep-seqs.qza \ --p-perc-identity 0.99 \ --o-clustered-table table-dn-99.qza \ --o-clustered-sequences rep-seqs-dn-99.qza
However, I’m concerned, because as I understand it, the rep-seqs object does not contain ID information, so how do I know rep-seqs got filtered to the correct samples? Plus, I just changed the IDs for all of my samples in that feature table, so it must be matching
rep-seqs.qza some other way. I don’t understand how
sample-metadata.tsv all communicate about which samples and features are connected.
- Have I done this process right? Can I be confident I’m now clustering the right sequences?
- More generally, how do these different objects connect to each other so I can make informed choices in the future?
- Is there some way I can confirm this filtering worked correctly? Like in
RI would see if the sample IDs match using
%in%or by printing the list of IDs, but I’m not sure how to do a similar check in QIIME2.
Running QIIME2 2019.10 in a conda environment.