Is there a simple command that returns read counts to keep track of what's happening while going through the QIIME2 pipeline for 16s rRNA data?

Hey guys,

I am currently going through the QIIME2 pipeline for 16s rRNA sequence analysis. So far so good… However I would like to keep track of what is happening at each step with a simple overview of the read counts (overall and per sample). I would like to do this for both my samples and the SILVA classifier that I am training (I am testing out various identity and read length cutoffs for the classifier).

I guess I can use the qiime metadata tabulate or qiime feature-table tabulate-seqs commands but I am finding this to be a bit tedious and computationally expensive each time (particularly for the classifier as it is a large file).

I was wondering if there are any simpler commands that I can use that would return a total read count value right there on the terminal window or something like that? I am fairly new to this and would really appreciate your advice. :grinning: :grinning:

Hi @Anahita_Bharadwaj,
The best way to get a summary of read counts (total and per sample) in a feature table is using qiime feature-table summarize.

You can't inspect the classifier itself but you can inspect the classification results. qiime metadata tabulate is the best way to directly inspect the results, but a better way to evaluate results (if you have a known composition or even if you don't know the true composition but want to compare to a "gold standard" composition, e.g., the results using the default settings or your favorite method) is to use some of the actions in q2-quality-control.

Let me know if those are what you are looking for!

Hi @Nicholas_Bokulich,
Thanks for the response! That’s what I figured but wanted check just in case since generating the feature-table for the classifier reference reads was taking a (little) bit of time (each time I do it) and all I really needed was the read count and sequence lengths information.

Interesting note about q2-quality-control. I will check it out. It looks like there are mock communities that I can test against. Awesome!

So far I have,
qiime feature-classifier extract-reads
–i-sequences SILVA-138-SSURef-Full-Seqs.qza
–o-reads ref_seqs_SILVA_138

I was playing around with the idea of adding --p-identity 0.9 (due to this) although I don’t see any reason to.

I would really appreciate your general thoughts about it. Thanks again for all the great help and support!

1 Like

Those can be very time-consuming tasks, when summarizing information about a large table or large number of sequences. But you are right, generating those numbers without the visualizations would be faster.

There is a way to do this if you are comfortable working with QIIME 2's python API. Feature tables and sequence artifacts can be viewed as pandas DataFrames or Series, allowing easy summation of numbers of rows and/or samples. For example:

>>> import qiime2 as q2
>>> import pandas as pd
>>> tab = q2.Artifact.load('table.qza')
>>> seqs = q2.Artifact.load('rep-seqs.qza')
>>> tab.view(pd.DataFrame).shape
(34, 770)
>>> seqs.view(pd.Series).size
>>> print('my table has {0} samples and {1} features'.format(*tab.view(pd.DataFrame).shape))
my table has 34 samples and 770 features
>>> print('my representative sequences file contains {0} features'.format(seqs.view(pd.Series).size))
my representative sequences file contains 770 features

The last few lines are just an example of how you could put this code in a python script that you can call from the terminal and print out some nice words describing these tables.

Yeah no need — the issue described in that topic was really specific and we've only seen a few times, and fixed with the min and max length parameters that we've since added to extract-reads.

I hope that helps!


This is great, I haven’t ventured into using Python for R yet but this might a reason to start! Thanks for your help.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.