I’ve noticed a lot of my reads are still unmatched (as it’s easily guessable by the size of the resulting unmatched-cr-99.qza file).
I’d like to know if there is any easy way to figure out, exactly, what percentage of reads for each sample remains unclustered.
Also, while trying to make a barplot, I expected a ‘Others’ part of each of the bars to appear, referring to the unclassified ones, but this didn’t seem to be the case. Is there any option I’m missing? The plot was made on data filtered from chimeras.
You can run qiime feature-table summarize to get a count of sequences per sample, and compare this to the pre-clustered count per sample in qiime demux summarize.
No “others” should appear because you are effectively already filtering out any reads that do not match the reference (what would be classified as “other”) and then you are also not actually performing taxonomy classification — with closed-reference OTU clustering you are clustering to find the closest reference sequence to each query, and then adopting the taxonomy of that reference sequence (since all clusters are now labeled with the reference sequence ID, and you are using the reference taxonomy from greengenes).
The first suggestion worked, though I performed it on the FeatureTable resulting from clusterization (table-cr-99_new.qza), and the dereplicated table previously created and representing the input of the clusterization (dereplicated_table.qza). In both cases, I used qiime feature-table summarize.
Is that right? Why starting from demultiplexed reads as you suggested, which is a previous step, prior to the needed dereplicating step?
It does, I hadn’t considered it depended on the type of clusterization performed… and I’m going to try this now, thanks!
Not needed, I was just thinking before/after the clustering pipeline (and dereplication is part of clustering in my mind) but your way is more clever, since I don’t think any reads should be lost at the dereplication step.