Weird result from analyzing a public rumen microbiome dataset by Qiime2

buritta · August 9, 2022, 10:03pm

Hi everyone in the Qiime2 community,

I'm a Qiime2 user who encountered some weird problems when trying to reuse a rumen microbiome public dataset. I would really appreciate if I can discuss what I'm facing now with you.

Here's the background, the dataset I'm using is from "A heritable subset of the core rumen microbiome dictates dairy cow productivity and emissions (https://www.science.org/doi/10.1126/sciadv.aav8391) and I'm trying to reuse their paired-end sequenced16s rRNA sequences to perform the taxonomic classification using Qiime2.

I downloaded the SRA files and each of them is for a specific domain, i.e., bacteria, archaea. After running FastQC for each FASTQ file I converted from the SRA files, the "Per base sequence quality" plot looks super weird, which has an arch shape.

In the paper, the authors describe how the amplicon sequences were initially processed with OBITools. With the above picture and some other clues, I infer that each FASTQ file should contains the joined paired-end reads produced by the function "obijoinpairedend" or "illuminapairedend" from OBOTools. So I decided to use DADA2 single-end sequence denoising, instead of paired-end though I'm not sure if I can just treat the files as single-end sequence files.

The denoising stats look fine to me. Here is the qiime visualization file I got from the bacterial sequence files.

bacteria_denoising_stats.qzv (1.2 MB)

For the taxonomic assignment, I used the pretrained gg-13-8-99-515-806-nb-classifier.qza from Qiime2. In the end, I got 30924 entries for all the baterial sequence files (from over 200 samples), as you can see from the figure as well as the file. However, if you look into the file, only less than 100 entries have genus-level resolution and all of them are even from the same genus.

bateria_taxonomy.qzv (3.9 MB)

So now I'm really confused. Is the dataset simply not usable? Or did I miss out something? What do you think?

Kind regards,
Buritta

llenzi · August 11, 2022, 8:55am

Hi @buritta.
welcome in the forum!

I agree, the quality profile strongly suggest these sequences are post processed and the quality values were changed during the process. The dada2 denoiser do not like altered qc scores!
Altough it may not change the final outcome, I suggest to use deblur in your case.
The second deblur benefit is that applys a prefilter to exclude any sequences not rougthly similar to any in the 16S database, which may clean your data by excluding contamination, but form its stat you may have an idea on how many things are not basacteria in there.

What did you use to perform taxonomy classification (sklearn vsearch or blast)?
You seem to have used a classifier obtained by extracting reads form the basis 515 to 806, in the 16S gene . Do you know if this is the same region amplified in the experiment? If not, it may explain the odd assignments. If the amplified region is not clear from the paper, the easier way it to use a classifier obtained form the whole gene!

Best wishes,
Luca

buritta · August 25, 2022, 1:11pm

Hi @llenzi,

Thanks for your insights on the benefits of using deblur.

For the classifier, I used the sklearn one pretrained form the basis 515 to 806 indeed and that was also the reason of the problem. The original paper doesn't clarify which region of 16S rRNA was used but now I have switched to the classifier trained from the whole gene as you suggested and the number of ASVs I got now seems normal. So your guesses are all correct and are very helpful. Thanks again!

Kind regards,
Buritta