How to retain 'Unmatched' features in barplots (as 'Unassigned') ?

Sparkle · October 23, 2019, 1:15pm

Hello,
I've tried to make a barplot (with qiime taxa barplot) for the results of clustering analysis (performed with vsearch and dada2, I made both attempts with different strategies and thresholds), but the plot seems to include only the clustered sequences.
Instead, I'd like to see also the unassigned sequences there.

I've noticed that qiime vsearch cluster-features-closed-reference in particular generates a file including such sequences. (resulting from the --o-unmatched-sequences option), which I don't know how to use for further analysis such as creating a barplot including them.

For instance, in this picture found on QIIME2 website, the part I'd like to retain is represented in grey and called 'Unassigned;;"

https:/uploads/qiime21/original/2X/c/cfe346adde1ac9b7292daa8bd85cfab5b844b3a2.gif

How do I make the unmatched sequences appear in the plot?
Also, how do I take them into account if I use the output of the option --o-clustered-sequences for the next chimera-filtering step, which is performed only on clustered sequences as a consequence?
Do I have to merge the files somehow?
Thanks in advance!

I'm using qiime2-2019.7

colinbrislawn · October 23, 2019, 2:04pm

Good morning!!

(Nice screen capture, btw! )

The best way to do this is pick a denoising or OTU clustering method that never removes them in the first place.

For example, dada2 will remove reads that are errors or noise, but it will keep every read even if they cannot be assigned taxonomy.

cluster-features-closed-reference is designed to remove reads that don't closely match the database. If you want to keep all your reads and still make OTUs, try
https://docs.qiime2.org/2019.7/plugins/available/vsearch/cluster-features-open-reference/
or even better! Try this:
https://docs.qiime2.org/2019.7/plugins/available/vsearch/cluster-features-de-novo/

I'm not sure. I think when people build barplots of their data, they usually choose to not show low quality reads, reads that do not pair, and reads that are chimeric. It's an interesting idea to show all your data (good data and errors, unpaired, and chimers) in a single graph.

Colin

Sparkle · October 23, 2019, 2:27pm

I didn't make it, it appeared on Google while looking for examples of barplots including them...

For example, dada2 will remove reads that are errors or noise, but it will keep every read even if they cannot be assigned taxonomy.

I tried that strategy: I used dada2, followed by training classify-sklearn with 99% otus from GreenGenes, and then made the plot with the resulting taxonomy file. But as you can see there is no part of the plot referred to as 'Unassigned'. (Yes, those samples have a huge chloroplast DNA contamination and I'm just using them for training for now)

qiime dada2 denoise-pyro --i-demultiplexed-seqs sequenze/single_end2.qza --p-trunc-len 0 --output-dir denoise --verbose --o-denoising-stats denoise/denoising_stats.qza --o-representative-sequences denoise/representative_sequences.qza --o-table denoise/table.qza

qiime feature-classifier classify-sklearn --i-classifier gg_13_8/gg_13_8_otus/classifier.qza --i-reads denoise/representative_sequences.qza --o-classification taxonomy_dada.qza

qiime taxa barplot --i-table denoise/table.qza --i-taxonomy taxonomy_dada.qza --m-metadata-file metadata.tsv --o-visualization taxa-bar-plots_dn_99_dada2_OK.qzv

If you want to keep all your reads and still make OTUs, try
cluster-features-open-reference: Open-reference clustering of features. — QIIME 2 2019.7.0 documentation
or even better! Try this:
cluster-features-de-novo: De novo clustering of features. — QIIME 2 2019.7.0 documentation

I think I'm missing something, because I tried vsearch too, with all the three possible approaches, but no 'Unassigned' appears in the plots as well. Some may be VERY generic (Like 'Bacteria'), but no Unassigned ones.

It’s an interesting idea to show all your data (good data and errors, unpaired, and chimers) in a single graph.

Exactly, I'm particularly interested in this because I'd like to see (graphically) how many of my sequences remained unpaired though I could do that 'manually' by following these suggestions.

colinbrislawn · October 23, 2019, 2:44pm

Ah OK.

I think you are going to have to do this manually. Qiime 2 sort of expects all the low quality, unpaired, and chimeric reads to be removed before graphing, so I think this graph will have to be made outside of Qiime. You could still use the Qiime 2 API in python or the Phyloseq package in R to make a graphs like this.

While we are discussing this, I guess I wanted to talk about the terminology a bit.
denoising is a separate step from taxonomy assignment/sequence-classification
When a sequence cannot be classified into a taxonomy, it might be called Unclassified. But it's still in the data set and will appears on graphs.
When a sequence is removed by dada2, it's called based on why it was remove ('unpaired, low quality, and chimeric reads'). These generally don't appear in graphs or stat tests at all as they are considered non-informative noise.

So while I would never put chimeras on a barplot, I do think it interesting to see 'where did my reads end up?' Perhaps something like a Sankey diagram can show where reads are 'lost' during quality control.
https://www.ifu.com/fileadmin/user_upload/esankey/content/screenshots/Sankey-diagram-food-supply-chain.png

Colin

Sparkle · October 23, 2019, 3:03pm

Yes! Once I get the number of clustered reads (after running vsearch for example) with the summarize table plugin, as well as the number of reads before running it, I can substract the former and get the information I need easily.

Qiime 2 sort of expects all the low quality, unpaired, and chimeric reads to be removed before graphing,

Indeed, so I guess this is also true for anything that can't be clustered at the chosen identity percentage, and will be called 'Unmatched' and outputted for instance by vsearch separetely.

When a sequence is removed by dada2, it’s called based on why it was remove (‘unpaired, low quality, and chimeric reads’). These generally don’t appear in graphs or stat tests at all as they are considered non-informative noise.

Indeed, I agree with you!

When a sequence cannot be classified into a taxonomy, it might be called Unclassified . But it’s still in the data set and will appears on graphs.

But if this is the case... why can't I see mine after performing a closed-reference clustering with vsearch? I mean, the 'Unmatched' ones.

colinbrislawn · October 23, 2019, 4:37pm

So cluster-features-closed-reference is a sort of a strange method when you compare it to modern denovo denoising methods like dada2.

It's a one-step-process in which you take each one of your reads and align it to a database.

Your OTU table is just a list of how many times your reads matched the database.
Reads that don't matched are ignored.
Taxonomy assignment isn't needed, as you are not making new feature, just counting ones in the database, which all have taxonomy already.

Closed-ref clustering doesn't do taxonomy assignment. It's just counting what's already in the database.

Colin

system · November 23, 2019, 10:37pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.