Unmatched sequences percentage from vsearch

Sparkle · October 22, 2019, 1:18pm

Hello,
I'm running a 16S amplicon closed-reference clustering on previously denoised (dada2) and dereplicated reads, against GreenGenes database (99% identity), as follows:

qiime vsearch cluster-features-closed-reference --i-table denoise/dereplicated_table.qza --i-sequences denoise/dereplicated_sequences.qza --i-reference-sequences gg_13_8/gg_13_8_otus/99_otus.qza --p-perc-identity 0.99 --o-clustered-table table-cr-99_new.qza --o-clustered-sequences rep-seqs-cr-99_new.qza --o-unmatched-sequences unmatched-cr-99_new.qza --p-strand both

I've noticed a lot of my reads are still unmatched (as it's easily guessable by the size of the resulting unmatched-cr-99.qza file).

I'd like to know if there is any easy way to figure out, exactly, what percentage of reads for each sample remains unclustered.

Also, while trying to make a barplot, I expected a 'Others' part of each of the bars to appear, referring to the unclassified ones, but this didn't seem to be the case. Is there any option I'm missing? The plot was made on data filtered from chimeras.

qiime taxa barplot --i-table uchime-dn-out_dn_new/table-nonchimeric-wo-borderline.qza --i-taxonomy gg_13_8/gg_13_8_otus/taxonomy.qza --m-metadata-file metadata.tsv --o-visualization taxa-bar-plots_dn_concloroplasto.qzv --verbose

I'm using qiime2-2019.7

Thanks in advance!

Nicholas_Bokulich · October 22, 2019, 2:28pm

You can run qiime feature-table summarize to get a count of sequences per sample, and compare this to the pre-clustered count per sample in qiime demux summarize.

No "others" should appear because you are effectively already filtering out any reads that do not match the reference (what would be classified as "other") and then you are also not actually performing taxonomy classification — with closed-reference OTU clustering you are clustering to find the closest reference sequence to each query, and then adopting the taxonomy of that reference sequence (since all clusters are now labeled with the reference sequence ID, and you are using the reference taxonomy from greengenes).

Let me know if that makes any sense!

Sparkle · October 22, 2019, 3:13pm

The first suggestion worked, though I performed it on the FeatureTable resulting from clusterization (table-cr-99_new.qza), and the dereplicated table previously created and representing the input of the clusterization (dereplicated_table.qza). In both cases, I used qiime feature-table summarize.
Is that right? Why starting from demultiplexed reads as you suggested, which is a previous step, prior to the needed dereplicating step?

It does, I hadn't considered it depended on the type of clusterization performed... and I'm going to try this now, thanks!

Nicholas_Bokulich · October 22, 2019, 3:42pm

Oh that works too

Not needed, I was just thinking before/after the clustering pipeline (and dereplication is part of clustering in my mind) but your way is more clever, since I don't think any reads should be lost at the dereplication step.

system · November 22, 2019, 9:42pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.