The high ratio of unassigned sequences

5_withoutbg_gg_taxa-bar-plots.qzv (500.6 KB) 5_withoutbg_taxa_bar_plots.qzv (788.2 KB)

Hello,

Please check the two .qzv files. I have tried to use two classifiers to assign the rep-seqs from DADA2, but the ratio of "unassigned" sequences is still very high, which can not be assigned to kingdom.

PE300 was used to sequence the V4 region of all samples. The classifiers were trained using the Greengene database and the newest Silva database with my own primer sequences.

I have changed the parameters of DADA2 to retain longer reads with more bases, but it is not helpful to decrease the ratio of "Unassigned".

It is worth noting that the "Unassigned" sequences only exist in the isotope-labelled samples.

I want to know how to deal with the these "Unassigned" sequences. For example, I want to do Lefse analysis to look for significant biomarkers, should I take these unassigned sequences into account?

Thanks in advance!

Morton

1 Like

Hi @nmgduan,
Sounds like you are suffering from a non-target DNA problem.

This will not fix it because the issue is not with read length or with the classifier itself. Your other classifications are actually very good! The fraction of unclassified and kingdom-only classifications points to a non-target DNA issue.

Yes, this may be the issue. You should examine your protocol to see if there is any reason why these samples may contain non-biological DNA.

Do two things in this order:

  1. spot-check a few of these reads with NCBI-BLAST to see what kind of non-target DNA this is (e.g., is it host DNA? is it actually bacterial but contains some kind of adapter that is causing classification to fail?). If these are non-target, proceed to step 2.
  2. Use qiime feature-table filter-features to remove any features that do not have at least phylum-level classification. The online tutorials give an example of how to do this.

Good luck!

Thank you for your prompt reply.

I will try as you suggest, and post the results here.

Hi @Nicholas_Bokulich,

For more accurate assessment, I checked all of “Unassigned” sequences with NCBI-BLAST.

The results show that all of these reads are from prokaryote, such as “uncultured prokaryote clone”, “uncultured bacterium clone”, “uncultured archaeon clone” and so on.

Some reads have at least genus-level classification, “uncultured Wolinella sp.”, “uncultured Sedimentibacter sp.”; family-level classification, “Bradyrhizobiaceae”.

The samples were collected from anaerobic digester, so I think the host DNA is negative.

Do you think whether these “unassigned” sequences should be retained as “unassigned” for the downstream analysis, such as Lefse? What’s your suggestion?

Thanks in advance!

Morton

Use the "exclude uncultured" option. Look at the quality of alignment.

I still think this is probably junk DNA, especially since it is only found in the isotope-labeled samples this sounds like methodological artifact (unless if you expect that isotope labelling would pick up species that standard sequencing would not).

Are the primers removed from the sequences?

16S_Database_NCBI.txt (194.2 KB)
Hi @Nicholas_Bokulich,

The primers are removed from the sequences, and the parameters of --p-trim-left-f 13, --p-trim-left-r 13, --p-trunc-len-f 220 and --p-trunc-len-r 200 can ensure that all non-biological sequences are removed before DADA2 processing.

I tried to set the value of --p-confidence in qiime feature-classifier classify-sklearn as 0.5 instead of 0.7, but it did not work.

As you suggested, I checked all of "unassigned" sequences using 16S ribosomal RNA (Bacteria and Archaea) database in NCBI-Blast instead of nr/nt database. Please check the attached file which shows the quality of alignment.

I find this situation may be caused by the incomplete annotations of consensus_taxonomy_7_levels.txt in SILVA_132_QIIME_release.

For example, the query sequence f7f26f1612625f7470717e7039902136, whose best alignment in NCBI belongs to Oligotropha carboxidovorans OM5 16S ribosomal RNA, but there is no annotation about carboxidovorans in the .txt file.

The high ratio of "Unassigned" sequences will not influence the calculation about α- and β-diversities, but it can influence the statement of community composition, or microbial differences at the taxonomic level (Lefse) :worried:.

I think I should take them into account as a individual catalogue named "unassigned", so there are three kingdom, bacteria, archaea and unassigned. Because they can be aligned to specific organisms. What's your idea? :star_struck:

Morton

Thanks for sharing those results.

I see the % identity is high for many of these, but I also see a few other things:

  1. I do not see the % coverage. The % identity only describes the aligned region, so some of these may be partial alignments. Could you have chimera that dada2 is not catching? Or other artifact?
  2. % identity is low (≤ 95%) for maybe around 50% of these, which of course could mean a novel organisms or, e.g., chimera

That should not be an issue — the classifier should classify to the nearest common ancestor, e.g., the family or genus.

it could be a problem for alpha and beta diversity if these represent noise. E.g., if they are chimera you don't want to account for these when estimating alpha diversity

That works.

All in all, these do appear bacterial enough that I would not discard them — but I still do not know why the classifier is not classifying these. I suppose the good news is that they are just a fraction of your overall sequences, which otherwise appear to be classified well.

Hello @Nicholas_Bokulich,

I see the % identity is high for many of these, but I also see a few other things:
I do not see the % coverage. The % identity only describes the aligned region, so some of these may be partial alignments. Could you have chimera that dada2 is not catching? Or other artifact?
% identity is low (≤ 95%) for maybe around 50% of these, which of course could mean a novel organisms or, e.g., chimera

Please check the attached .tsv file. The query coverage is almost all 100%. Then I used qiime vsearch uchime-denovo to detect the chimeric feature sequences in the output of DADA2. Only 7 from 2517 sequences were identified as chimeras, but the ratio of the 7 sequences was negative (6-42 reads in all 36 samples). So I think the impact of chimera is negative.

I think I should do downstream analysis without discarding them. :muscle:RC24NNZT014-Alignment-HitTable.csv (207.4 KB)

1 Like

Makes sense! Thanks for sharing the full blast report.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.