Problem getting Taxonomic analysis with consensus blast to work on large fastq files


I have been using QIIME2 for a long time but consider myself as one with the skills of a near beginner.

I am using qiime2 2024.2 on a Apple M1 Max with 64 GB memory for 16SrRNA gene amplicon sequencing (V3-4) on mouse fecal samples
Each file which I obtained from a collaborator, is extremely large, 70 to 90 megabytes. We are used to 2 MB files.

The script is one we have used for some time with human oral samples with no problems.Shown below is the script that worked more or less with 6 of the mouse fecal samples.

First problem is I could not get DADA2 denoise to work unless I put in as parameters trim left 35. And trunc 238. Larger fragments did not work.
I then assigned taxonomy on 6 samples using consensus-blast with the SILVA dataset and it worked after several hours. Some samples had as many as 300,000 reads.

I was never able to get taxonomy assignment to work on the full data set of 35 samples once I got to the taxonomy step. Even if I waited 48 hours.

Out of desperation I tried shortening fastq files using sed command "sed -i.backup '30001,$ d' C57F4_S4_L001_R1_001.fastq.gz" but then got the error below due to loss of important text:An unexpected error has occurred:
Compressed file ended before the end-of-stream marker was reached

I attach taxa bar plot from 6 sample run. I also attach supporting files from run of all 35 samples which did not complete the Consensus blast step. Would you be able to tell me a way to get taxonmay assignment to work with the full dataset.

There are no error messages it just never finishes. Would I be better off trying another method to do the taxonomy or denoising, or switching 16S rRNA sequence reference database that is smaller?



Guy Adami, PhD
University of Illinois Chicago

classification-Silva-35-238.qza (96.6 KB)

qiime tools import
--type 'SampleData[SequencesWithQuality]'
--input-path 16S-ExperimentF
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path demux-single-end-35-238234.qza

qiime demux summarize
--i-data demux-single-end35-238234.qza
--o-visualization demux-single-end35-238234.qzv

qiime tools view demux-single-end35-238234.qzv

qiime dada2 denoise-single
--i-demultiplexed-seqs demux-single-endF.qza
--p-trim-left 35
--p-trunc-len 238
--o-representative-sequences rep-seqs-dada2-35_238.qza
--o-table table-dada2-35_238.qza
--o-denoising-stats stats-dada2-35_238.qza

qiime feature-classifier classify-consensus-blast
--i-query rep-seqs-dada2-35_238.qza
--i-reference-reads /Volumes/LaCie/Silva_XXReferences/silva-138-99-seqs.qza
--i-reference-taxonomy /Volumes/LaCie/Silva_XXReferences/silva-138-99-tax.qza
--p-perc-identity 0.98
--o-classification classification-Silva-35-238.qza
--o-search-results search-FORWARD-35-238.qza

qiime metadata tabulate
--m-input-file classification-Silva-35-238.qza
--o-visualization classification-Silva-35-238.qzv

qiime feature-table summarize
--i-table table-dada2-35_238.qza
--o-visualization table_35-238.qzv
--m-sample-metadata-file MD-PaolaStest.txt

qiime taxa barplot
--i-table table-dada2-35_238.qza
--i-taxonomy classification-Silva-35-238.qza
--m-metadata-file MD-PaolaStest.txt
--o-visualization taxa-bar-plots-35-238.qzv

taxa-bar-plots-35-238._6samplerun.qzv (395.4 KB)
vEo2eRDDTdTi7.qza) (12.1 KB)
table_35-238.qzv (420.1 KB)
table-dada2-35_238.qza (41.5 KB)
classification-Silva-35-238.qza (96.6 KB)
search-FORWARD-35-238.qza (127.0 KB)

1 Like

Recently we got one dataset with more than 1 000 000 reads per sample. So, before proceeding with downstream analyses, after importing reads to Qiime2 we subsampled reads to fraction 0.3 (choose based on your data). Check available plugins in the documentation for the right plugin.
After subsampling, remove primers with cutadapt and proceed with Dada2 with parameters by your choice.


Hi Timur,
Thanks for your advice about using demux subsample to randomly remove a large percentage of the reads after importing into Qiime2. The steps afterward then worked well with alignment to the SILVA reference data.

qiime demux subsample-single
--i-sequences demux-single-endO2.qza
--p-fraction 0.1
--o-subsampled-sequences demuxsubsample16one.qza

If it is possible to ask a final question? Is usage of consensus blast okay with the Greengenes2 reference database?

Thanks again for your great advice.



Glad that it worked!

It should work fine with GG2 database.