acceptable amount of unassigned OTUs

Hi all!

Hoping everything is going well,
I am writing to ask about the fraction of OTU for which the taxonomic assignment somehow fails.
Which is an acceptable amount of unassigned reads in an experiment?
I am dealing with both high and low biomass samples, however also in the high biomass sample I can find roughly more than 20% of unassigned if I consider the otu rgrouped at Phylum level.
I am currently working on OTU tables produced by qiime1 .

Thanks a lot,


Hi @MichelaRiba,

What type of data is this? 16S, ITS, COI…?
The answer here really depends on a) your sample types and b) how well that sample-type is represented by the reference database you are using. If I saw 20% unassigned with Greengenes/Silva at the phylum level in human/mouse/rat fecal samples I would consider that too high and would think something has gone wrong in my pipeline. If I was looking at some uncharacterized ecosystem, that 20% may be more acceptable because the reference database may simply not have those rare organisms. That being said, very rarely have I seen a pipeline lead to 20% unassigned after proper handling. What I mean is making sure you properly have gotten rid of chimeras, have removed primers/non-biological sequences from your reads, removing untargeted (host) contamination etc. All of these issues are also more pronounced with low biomass samples.

If you are able to re-start with QIIME 2, I would strongly recommend that. The denoisers in QIIME 2 (dada2/deblur) do a much better job of quality control than what was default in QIIME 1.


I thank you a lot for this discussion!

I am currently referring to human gut 16S microbiome data, even if I have also low biomass samples (urine).

According to the point of contaminating sequences, I checked the input files, that is joined fastq pairs and I did not find adapter contaminants using fastqc for the samples considered.

Regarding qiime2
At the moment I could not use qiime2 for the immediate, even if I have already tried on other samples to set up the pipeline using vsearc instead of usearc, but not dada-deblur since I would like to reproduce the qiime1 pipeline as first goal, even with some differences.

For this reason maybe it is something related to the possibility to match the database:

  • contamination from human sequences (?)
  • something related to the OTU clustering and representative sequence picking

I report the parameters I used for OTU clustering
uclust --input slout_single_sample_q20/otus/rep_set.fna --id 0.9 --rev --maxaccepts 3 --allhits --libonly --lib /lustre1/ctgb-usr/local/miniconda3/envs/qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta --uc /lustre2/scratch/tmp/UclustConsensusTaxonAssigner_GCtxUz.uc

I copy also part of the parameters used, I do not want to be annoying, however, maybe you could see something important in the settings

qiime_config values:
pick_otus_reference_seqs_fp	qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
sc_queue	all.q

torque_queue friendlyq
jobs_to_start 1
denoiser_min_per_core 50
temp_dir /lustre2/scratch/tmp/
blastall_fp blastall
seconds_to_sleep 1
parameter file values:
parallel:jobs_to_start 1
pick_otus:max_rejects 8
pick_otus:word_length 8
pick_otus:max_accepts 1
pick_otus:stepwords 8
pick_otus:enable_rev_strand_match True

From the laboratory preparations could you suggest guidelines in order to avoid contaminantion?
For example is it important to excide the PCR band from the gel to optimize the specificity?

Is it possible to exclude for example human sequences (? mitochondrial derived??) before doing the clustering and taxonomy assignment?

Thank you very much,


I am sorry for writing again, anyhow I have to say that I reported that 20% coming from the output of the function otuReport in the R library OTUSummary,
anyhow if I consider the complete table, in which I have a per sample percentage the values I see are really different:
I report the summary of the distribution of the unassigned_NA
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.2941 1.0219 2.1235 3.0430 22.6126

Maybe there is something I have to check about the OTUsummary function in the overall report

I checked also directly in the taxonomy tables produced by qiime1 analysis, sorry for I have a larger dataset of nearly 200 samples, while the previous results refers to a subset of 40.
and the result is pretty the same, even if I have some noisy samples:
Sorry for here the results are the original ones represented as decimals and not percent

Min.  1st Qu.   **Median**     Mean  3rd Qu.     Max. 
0.001726 0.015072 **0.020682** 0.021222 0.026037 0.072456 

In the end could I conclude that for now is it acceptable to have 2% (instead of 20%) unclassified sequences at L2 level in a gut microbiome experiment?

Thanks a lot!!!


Hi @MichelaRiba ,
Just to comment — often unclassified sequences are not necessarily novel uncharacterized biodiversity. Often it is non-target DNA or other “garbage” that should be removed. In my experience, if a sequence does not classify to at least phylum level it is often non-target, e.g., host DNA. I recommend spot-checking some unassigned sequences, e.g., using NCBI BLAST, to see if they classify as host, PhiX, or other unwanted sequences…

Good luck!


thanks a lot for your kind follow up.

I will check some sequences.

Thanks a lot again