I am using qiime2 for barplots generation. Very few features are identified in my barplots, although my table.qza contains 425 ASVs. It looks like this:
Earlier I thought there is an issue with my samples. But half of the samples from this table.qza was used by me earlier with some other samples for another analysis and that barplot was looking fine.
I am using ncbi classifier for this. The barplot looked fine when I am using silva or GTDB classifier. Please suggest.
Interesting! I'm not sure what's going on here...
Can you post the Qiime2 commands you used to classify your reads with NCBI, silva, and the GTDB databases? What do the barplots look like when using SILVA or GTDB? I know SILVA is built to classify 16S / 18S data, so it's possible is performs much better than the general-purpose NCBI database.
While I'm asking you a bunch of questions, what kind of data are you using? 16S V4 or something else?
I am working on gut microbiota and using 16S V3-V4 reads. These are the commands I used for classification of reads using NCBI, GTDB and SILVA database respectively:
qiime feature-classifier classify-sklearn --i-reads rep-seqs.qza --i-classifier ncbi_341F805R_classifier.qza --p-n-jobs 6 --o-classification taxonomy.qza
qiime feature-classifier classify-sklearn --i-reads rep-seqs.qza --i-classifier GTDBclassifierV3V4.qza --p-n-jobs 6 --o-classification taxonomy.qza
qiime feature-classifier classify-sklearn --i-reads rep-seqs.qza --i-classifier silva_99_341F805R_classifier.qza --p-n-jobs 6 --o-classification taxonomy.qza
barplots obtained using GTDB:
barplots obtained using SILVA132:
barplots obtained using NCBI:
Thanks! Yeah, it looks like GTDB and SILVA are working a lot better then NCBI...
Did you build
ncbi_341F805R_classifier.qza with a tool like RESCRIPt or get it from a colleague? It's not on the data resources page so I'm curious about how it was built and way it may be worse than the other two.
I built the ncbi_341F805R_classifier.qza using RESCRIPt.
I used the same ncbi classifier for another sets of data and it worked fine.
I thought may be the samples which I am using , they have an issue. I am attaching another barplot where sample codes H16-H28 from the above results were used with other samples, and NCBI classifier was used:
As we can see, we are getting more taxa information here.
ncbi_341F805R_classifier.qza is working much better in that last example.
I'm not sure why it would fail on that first run... Could you post the command you used to train the classifier too? Any other details and clues you are provide are much appreciated!
Let's see if @SoilRotifer (one of the RESCRIPt devs) has seen this issue before.
The following commands were used for training the NCBI database:
*qiime rescript get-ncbi-data --p-query '33175[BioProject] OR 33317[BioProject]' --o-sequences ncbi-refseqs-unfiltered.qza --o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza
qiime rescript filter-seqs-length-by-taxon --i-sequences ncbi-refseqs-unfiltered.qza --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza --p-labels Archaea Bacteria --p-min-lens 900 1200 --o-filtered-seqs ncbi-refseqs.qza --o-discarded-seqs ncbi-refseqs-tooshort.qza
qiime rescript filter-taxa --i-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza --m-ids-to-keep-file ncbi-refseqs.qza --o-filtered-taxonomy ncbi-refseqs-taxonomy.qza
qiime rescript evaluate-fit-classifier --i-sequences ncbi-refseqs.qza --i-taxonomy ncbi-refseqs-taxonomy.qza --o-classifier ncbi-refseqs-classifier.qza --o-evaluation ncbi-refseqs-classifier-evaluation.qzv --o-observed-taxonomy ncbi-refseqs-predicted-taxonomy.qza
qiime feature-classifier extract-reads --i-sequences ncbi-refseqs.qza --p-f-primer CCTACGGGNGGCWGCAG --p-r-primer GACTACHVGGGTATCTAATCC --o-reads ncbi_341F805R.seqs.qza
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads ncbi_341F805R.seqs.qza --i-reference-taxonomy ncbi-refseqs-taxonomy.qza --o-classifier ncbi_341F805R_classifier.qza --verbose &> ncbi_341F805R_classifier.log
If there is a problem in the classifier, then I think it would not have worked for the other datsets. If there is a problem in the samples, then sample codes with H16-H28 from the 3rd and 4th figure should have reflected the same taxa information. I thought may be I was missing some steps, so I did the entire analysis from the beginning, but still got the same results.
I am more interested in using the NCBI database for all my datasets as it is giving more species level information.
I am stuck in this step since 3 days. Please suggest.
I am attaching another barplot where I have done the analysis of samples (i.e. with codes starting from A, N and H from fig4 (i.e with very few taxa information) with some other samples ( i.e. with codes X):
As we can see, more taxa information is there. But I am not able to understand why there is issue when I am doing the analysis of these samples alone.
Wait, this is a key detail that I missed before!
Is that last plot using the exact same pipeline as the one that all
s__batmumici, just with extra samples added before processing? If so, that's a very strange results we should look into further...
Could you post those two .qzv files so we could take a look at the provenance and logs inside?
The last plot is prepared using the exact same pipeline as the one with all s_batumici, with extra samples added before processing. For my analysis, I am currently doing the id based filtering from combined dataset and using the filtered table.qza for further analysis.
By mistakely, I have deleted the earlier files. I will again do the analysis separately and post qzv files.