Output in vsearch cluster-features-closed-reference differs depending on --p-strand option


I’m working with the qiime vsearch cluster-features-closed-reference command. This command has an optional condition, called --p-strand, where you can decide between using 'plus' or 'both'. The first one is used when your sequences are in forward or backward direction and the second one when you have a mix of both directions.

As I am working with Ion Torrent 16S Metagenomic Kit, which I think mixes both directions, I tried the two options of the command, just following my curiosity, so I could see if there was any differences.

This was what I got:

  1. Using ‘plus’: I got many unmatched sequences. The representative sequences were 1407, while the unmatched ones where 2696. (Total features = 4103)
  2. Using ‘both’: The unmatched sequences diminished. I got just 997 unmatched features, but, the rep-seqs number also decreased, obtaining just 1754 features. (Total features = 2751)

Why is this happening? Why am I getting different number of features? Shouldn’t the total number of features be the same for both options?

And… in my demux summarize table i get 4.814.360 total sequences… I thought that each feature corresponds to each sequence, so, why am I getting this small number of features?

I’m quite new in qiime2 and maybe the answer is too obvious, but It’s getting me confused… :frowning:

Thank you in advanced :smiley:

Hi @MiriamGorostidi,

Let me take a shot at answering this…

You are correct to use the --p-strand both setting.

Ahh… interesting. I think I may be able to explain this counter-intuitive result. For #1, it is entirely possible to obtain false hits. That is, there is always a possibility of a reverse oriented read to match , in part, to a sequence fragment in the reference database oriented in the opposite direction. We see this happen quite often when users try to assign taxonomy on reads which are not oriented similarly to the reference database. This can result in the illusion that you are observing more features than there actually should be. This will not always be the case of course.

For #2, when your reads are correctly oriented with respect to the reference database, then you are in effect, increasing the similarity of your reads to one another. Let’s say you have have two identical sequences in your data set. However, because one is in the opposite orientation with respect to your reference database, the closed-reference algorithm (set to plus) will correctly identify one of those reads, and will either be unable to identify the other read, or mis-identify that other read as something entirely different. The latter will create the illusion of more features. When the reads are correctly oriented, then those identical reads will correctly map to the same reference, thus reducing the number of observed features in your data. Does this make sense?

Features, in this case, are merely unique representative sequences in your data i.e. an ESV, or OTU. For more information check out parts of this overview. That is, a feature can exist in your sample 1000 times. That is you have 1 Feature, that is present in your data with a Total Frequency of 1,000 individual reads or sequences. The number of features and sequences for a given feature-table, can be observed through a visualization. Check out the example on the qiime2 view page, entitled " Feature Table Summary".

-I hope this helps!


OMG @SoilRotifer !! This explanation has been really clear! Now everything makes sense :smiley:
Thank you so much for that and for answering so fast!!


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.