DADA2 giving less observed ASV and taxonomy classifier

Hi,
I am new to microbial analysis and got stuck with my analysis. I have install qiime2-2022.2 in my conda env. I have RNAseq data obtained using shoreline Biome complete kit (strain ID) which is a pacbio product. I followed their procedure to do advance ASVs analysis which uses DADA2 pipeline.R in terminal. The ASV were classified using Athena database provided in SBanalyzer software. The files generated were then imported into qiime2 to do further analysis by using following code:

-Import the ASV .tax feature taxonomy into Qiime2
qiime tools import --input-path <path to .tax file> --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat --output-path taxonomy.qza
where
<path to .tax file> is the absolute path to the ASV .tax file generated by sbsearch.exe

-Import the .biom feature table into Qiime2
qiime tools import --input-path <path to .biom table> --type 'FeatureTable[Frequency]' --input-format BIOMV100Format --output-path table.qza
where
<path to .biom table> is the absolute path to the .biom ASV frequency table file generated by DADA2

-Import the ASV .fasta file into Qiime2
qiime tools import --input-path <path to ASV .fasta> --type 'FeatureData[Sequence]' --output-path rep_seqs.qza
where
is the absolute path to the ASV .fasta file generated by DADA2

After this I followed moving picture tutorial.
I made feature table using above files. I followed moving picture tutorial for generating tree for phylogenetic diversity and alpha and beta diversity . The question that I have is:

  1. the observed ASV in my sample is too low (eg: lowest to 8 ASVs)

  2. I directly plotted taxonomy barplot using taxonomy files that was imported to qiime2 (taxonomy.qza). I am not sure if I need to make taxonomy classifier. As the ASV were classified using athena database, can I use SILVA trained classifier? How do we make 99% or 97% ASV sequence and 99% or 97% ASV taxonomy?
    I read through training feature classifiers with q2-feature-classifier but did not see anywhere that mention the percentage.

  3. how do I get how many ASVs reads were there in each sample?

Thank you!

@Vetshweta,
Welcome to the community! Let's see if we can get you unstuck.

Good work on your attempts to get your analysis off the ground so far. I am going to suggest that you try to stick completely inside one set of tools, as the analysis process is complicated enough without jumping between various software packages, and in this case I am going to advocate for sticking within QIIME 2 as much as possible, as we aim to deliver a complete set of tools, laid out in a consistent and well documented manner.

Very first, if you have not seen it already, I want to make you aware of these two resources, plugin-workflows and QIIME 2 for Experienced Microbiome Researchers. These pages can often help clarify the overall process and can help you plan out the roadmap for your analysis.

Following this line of reasoning, this year, support for DADA2 denoising of PacBio CCS data was added the Q2-DADA2 plugin, so your initial denoising and ASV generation should be able to happen inside of QIIME 2 now! Check out the docs for this functionality here. Additionally, this functionality was largely developed by DADA2 development team and the settings defined in it might be better tuned to getting the most out of CCS data using DADA2 than the instructions provided by your sequencing center(no promises there).

Getting optimal denoising results can take a bit of tweaking, often you will lose usable sample data if DADA2 detects too low of a quality score at anypoint in the sample read, so it is often better to trim your data to where the quality remains high, even if you end up losing some base pairs(not really an issue with PacBio sequencing, there are lots!), this will keep entire samples from being thrown out as quickly, here are some videos from one of our workshops that provide a bit more detail. Quality drop near the end of sample read in some ways is less of a problem with long read technologies, but the principles still apply, and getting these settings correct could save a lot of data. You will have to use a 'Manifest' import (docs) and cutadapt demux-single (docs) to demultiplex.

It is likely still worth creating taxonomically annotated data as opposed to performing only strictly distance based analysis methods (such as the diversity methods) where the distances are compute directly from the sequences themselves.

You can still use a SILVA classifier :slightly_smiling_face: Matching percentages only really apply to OTUs, not ASVs(which are generally a more accurate, modern approach, see this paper for more). But you can still use a classifier trained on OTUs! If you are interested in training your own classifier, I would checkout RESCRIPt, as it can make this process a lot easier. I would probably be worth doing the classification with the generic classifier first, just to get the process down first, then going back and training your own if it still feels necessary later, it can be a slow process :sweat_smile:.

Hope this helps and if I have missed anything or you have other questions, let me know!

2 Likes

Thank you for your response Keegan.
I am using Silva 138 99% OTUs full-length sequences classifier to classify my taxonomy. But some of them do not have species level. Is there any classifier which is species specific?

I read somewhere that we should only consider shannon/simpson index and not observed ASV obtained from dada2 as they do not read singletons. I could not find that post though. Is it okay if I write other index except observed ASV obtained from DADA2 pipeline?

1 Like

@Vetshweta,

In regards to not obtaining species level classification, this is pretty common and is a combination of the lengths of the input reads and the classifier. The machine learning model used to generate the taxonomy has to hit a certain confidence level before it will make an assignment at a particular taxonomic level, and often with short read sequencing, it is difficult to reach this level of confidence.

The commonly used V3V4 region sequencing particularly simply are not quite long enough to consistently produce confident species level identification, this is a function of the statistical power you can generate with a limited number of base pairs in a sample, rather than an issue with the database itself.

That is not to say that you absolutely will not be able to produce more species level matches using a different classifier. However, rather than a "better" general classifier, you may be be able to produce somewhat better results by training a classifier tailored to your environment/particular experiment, see the link to RESCRIPt.

DADA2 drops singletons, because they are far more likely to indicate a sequencing error than a relevant biological distinction. Thus, the ASVs produced by DADA2 should have shannon/simpson indexes that are essentially the same as those produced by clustering methods. In fact, in the "Moving Pictures" tutorial, the ASV output from DADA2 is used to calculate all of the diversity indexes, so you should be good to go :slightly_smiling_face:

3 Likes

Also, RE not achieving species level classification, you might want to checkout this great post by @jwdebelius that does a really good job of giving a more in-depth description of various factors that are at play here.

1 Like

Thank you for your detailed answer. I appreciate it.

1 Like