Percentage of sequences that survive clustering

Hi,
Often when I do OTU picking, I notice that huge amount of sequences fail to cluster, and I’m just wondering what is percentage of sequences that should be successfully clustered in comparison with those one that fail, in order for further analysis to be valid?

Hi @Dzana_Basic,
Could you please provide the example commands that you are using to give us a better sense of your process?

It sounds like you are probably using a closed-reference approach — you may want to try de novo or open-reference instead, or adjust your % identity cutoff value.

I am not sure that there are simple guidelines here — sequences failing to cluster may be from (and this is not an exhaustive list, just some thoughts):

  1. poor choice of reference database or clustering parameters (need to adjust to improve clustering)
  2. a high number of sequences that do not match the reference sequences closely (e.g., novel species especially if you are working with poorly characterized sample types, and you will probably not want to discard this information! hence open-ref or de novo clustering would be better)
  3. noisy sequences, contamination, chimera, etc — stuff you do not want! It may be worth BLASTing some of your seqs that fail to cluster to determine what they may be. Other methods supported in QIIME2, e.g., dada2 and deblur, will remove noisy sequences and detect actual sequence variants — you may also want to give these methods a try instead.

I hope that helps!

HI Nicholas, those are the commands that I use in my runs (REF_SEQS_PATH and TAX_PATH are taken from greengenes database):

split_libraries_fastq.py -i sample.fastq --sample_ids ‘sampleID1’ -o sl --barcode_type not-barcoded
cd sl

trivially parallelized, but still a slow step

parallel_pick_otus_uclust_ref.py -i seqs.fna -o po_closed/uclust_ref_picked_otus -r $REF_SEQS_PATH

make_otu_table.py -i po_closed/uclust_ref_picked_otus/seqs_otus.txt -t $TAX_PATH -o po_closed/otu_table.biom

And results of parallel_pick_otus_uclust_ref.py:

seqs_failures.txt - 589.4kB
seqs_otus.txt / 321.5kB

AND IN A LOG FILE, THIS IS WRITTEN, AMONG OTHER THINGS:

Num OTUs:561
Num new OTUs:0
Num failures:37369

MAYBE U CAN NOTICE WHAT I’M DOING WRONG…This sample is not so big, but with bigger samples, I get even worse results.

I just noticed that this is QIIME2 forum :slight_smile: because I posted question on QIIME1 as well.
And those commands that I posted previously, are QIIME1 related… sorry.
But anyhow, if someone is still using QIIME1, would be helpful to comment.

Hi @Dzana_Basic, thanks for clarifying. We are unable to provide support for QIIME 1 here (please note, there is no longer any official support channels for QIIME 1 - we highly recommend that you transition to QIIME 2). Thanks!

1 Like

Hi @Dzana_Basic,
You can read this tutorial to learn how to perform OTU picking in QIIME2. (but I would encourage checking out denoising methods as demonstrated here)

2 Likes

Thank you Nicholas. I will definitely check it. :slight_smile:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.