time running the qiime feature-classifier classify-consensus-vsearch command

Hello everyone!

I am completely new to QIIME 2, but so far I’m loving it. I’ve been running some commands on it, and it seems to be pretty smooth. My data is divided into three different cohorts: c.difficile + patients (n=176), c.difficile - patients but with diarrhea (n=176), and 176 healthy controls taken from SRA from a Japanese cohort of healthy individuals. In total I have around 14 GB of data. I am currently doing a microbiome analysis of the 16S rRNA v3-v4 region. We did this by sequencing in Illumina MiSeq 300 in PE mode. My data was demultiplexed by the sequencing centre. So far I’ve done the following:

  1. Created a metadata file for the fastq names and individual information.

  2. Created a fastq-manifest.csv file for my fastq reads, given that the format is a bit different from EMP and Casava1.8; command:
    qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path fastq_manifest.csv --output-path pe_demux.qza --input-format PairedEndFastqManifestPhred33

  3. Joined and merge through vsearch; command:
    qiime vsearch join-pairs --i-demultiplexed-seqs pe_demux.qza --o-joined-sequences joined_reads

  4. Read visualization; command:
    qiime demux summarize --i-data joined_reads.qza --o-visualization joined_reads.qsv

  5. Data denoising; command:
    qiime deblur denoise-16S --i-demultiplexed-seqs joined_reads.qza --p-trim-length 265 --p-sample-stats --o-representative-sequences reps_seqs.qza --o-table table.qza --o-stats deblur_stats.qza

5.1 Dereplication(I decided to try both denoising and dereplication, in parallel); command:
qiime vsearch dereplicate-sequences joined_reads.qza --o-dereplicated-table dereplicated_vsearch_table.qza --o-dereplicated-sequences dereplicated_sequences_vsearch.qza

6.Denoising visualization step; command:
qiime feature-table tabulate-seqs --i-data rep-seqs.qza --o-visualization rep-seqs.qza

7.Clustering; command:
qiime vsearch cluster-features-de-novo --i-sequences dereplicated_sequences_vsearch.qza --i-table dereplicated_vsearch_table.qza --p-perc-identity 0.99 --o-clustered-table clustered_table_denovo_otu_99.qza --o-clustered-sequences clustered_sequences_denovo_otu_99.qza

and finally

  1. Taxonomy classification; command:
    qiime feature-classifier classify-consensus-vsearch --i-query non_chimeras.qza --i-reference-reads SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_99_16S.qza --i-reference-taxonomy SILVA_132_QIIME_release/taxonomy/16S_only/99/16S_all_levels.qza --o-classification majority_taxonomy_99_98.qza --p-perc-identity 0.98 --verbose

The problem is that in step 8, during the taxonomy classification, the program has run for about 36 hours and it has only achieved 2%, it shows this:

Reading file /var/folders/cw/y8fqs65s2cn_hgmgkk574ryr0000gn/T/qiime2-archive-1zgtaz6z/ab21fea4-97a2-4b3f-b8a1-e95ce90b64bf/data/dna-sequences.fasta 100%  

521145303 nt in 369953 seqs, min 900, max 2961, avg 1409

Masking 11%aigg  

Masking 100%  

Counting k-mers 100%  

Creating k-mer index 100%  

Searching 2%

Am I doing something wrong here?

Thanks a lot! (:smiley:

1 Like

Thanks for all the diagnostic info @daniel.castanedamogo!

Here's your problem:

That's a lot of seqs, and very long seqs at that. These are all around full-length 16S... doing global alignment of 369953 full-length 16S seqs against SILVA full-length seqs (probably a similar number) will take quite a bit of time.

But you can do it faster. First off, I see an issue here:

I am guessing this is why your query seqs appear to be full-length 16S (I don't actually see where non_chimeras.qza came from in your workflow, so not sure what you are really classifying but that's my guess). A couple issues:

  1. open-ref OTU clustering is performing closed-ref OTU clustering as a first step, clustering your query seqs into full-length 16S reference seqs, which then become the representative seqs for that OTU. This is why you have full-length query seqs, they are actually the SILVA reference sequences that you clustered into.
  2. You do not need to do OTU clustering of any type after denoising... deblur is already denoising and dereplicating these reads, OTU clustering (while optional) will just reduce the amount of information that you have, i.e., by transforming ASVs (actual sequence variants) into OTUs. You probably know this but I just want to make sure that this is intentional.

Not clustering would mean you have shorter (V3-V4) ASVs, which would classify much faster.

Second, you could use the --p-n-threads parameter with classify-consensus-vsearch to parallelize this job and speed things up N-fold.

It sounds like you have a large amount of very long sequences so the long runtime is not unexpected. The steps I've advised above will reduce the wait.

Good luck!

2 Likes

Oh god! Ok, I see that I was running an extra clustering step that was redundant to what I had. Also, yes, I forgot to add a step before the taxonomy classifier:

7.1 qiime vsearch uchime-denovo --i-table clustered_table_denovo_otu_99.qza --i-sequences clustered_sequences_denovo_otu_99.qza --o-chimeras chimeras --o-nonchimeras non_chimeras --o-stats stats_chimeras

7.2 qiime metadata tabulate --m-input-file stats_chimeras.qza --o-visualization stats_chimeras.qzv

I will be rerunning this again but without the clustering step that I ran (step 7), and jump into the taxonomy classifier and see if my reads have a more expected size (~450 bp). Also, I encountered in some of my readings that dechimerization already occurs in Deblur, should I run it like I did in step 7.1 and 7.2? Or would that be also redundant for my taxonomy classification?

Thanks in advance,

Daniel

Sounds good! You will have more reads (since they won't be clustered at 99%) but they will be shorter, so that will help.

That's right, deblur already does chimera filtering so 7.1 is unnecessary even if you do want to cluster your ASVs.

Taxonomy classification will still take some time — you have a lot of reads from a bunch of studies! — but hopefully a lot less time (especially if you use --p-n-threads)

Hi again! It's me... Thank you for your fast and helpful responses! I keep having the same issue when I run the feature-taxonomy classification... I have decided to run everything again from start. These are my commands so far:

Joining reads

  1. qiime vsearch join-pairs --i-demultiplexed-seqs ../all_together/pe_demux.qza --o-joined-sequences joined_reads

Visualizing joined reads
2. qiime demux summarize --i-data joined_reads.qza --o-visualization joined_reads.qzv

Apply quality score filter
3. qiime quality_filter q-score-joined --i-demux joined_reads.qza --o-filtered-sequences joined_qfilt_seqs --o-filter-stats joined_qfilt_stats --verbose

Deblur denoising
4. qiime deblur denoise-16S --i-demultiplexed-seqs joined_qfilt_seqs.qz --p-trim-length 270 --p-sample-stats --o-representative-sequences representative_seqs --o-table representative_table --o-stats deblur_stats --verbose

Visualize representative sequences
5. qiime feature-table tabulate-seqs --i-data representative_seqs.qza --o-visualization representative_seqs

Run taxonomy classifier
6. qiime feature-classifier classify-consensus-vsearch --i-query representative_seqs.qza --i-reference-reads ../all_together/SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_99_16S.qza --i-reference-taxonomy ../all_together/SILVA_132_QIIME_release/taxonomy/16S_only/99/16S_all_levels.qza --o-classification majority_taxonomy_99_97.qza --p-perc-identity 0.97 --verbose --p-threads 8

The output from the last step was (from the first description):
521145303 nt in 369953 seqs, min 900, max 2961, avg 1409

As it can be seen, I have removed the chimera removal with vsearch, as well as the clustering step I put before, but the average read length seems to be 1409, when I am expecting something like 450bp. I have uploaded the visualization files to see if that clarifies things... Thank you so much for the patience!

fastq_manifest.csv (135.7 KB) joined_reads.qsv.qzv (301.5 KB) representative_seqs.qzv (913.5 KB)

according to your representative_seqs.qzv, the length of all seqs is 270 (as you would expect from trimming with deblur), so either

  1. you have mixed up your files and representative_seqs.qza is the old clustered file — check the provenance in q2view to be sure
    OR
  2. this line does not really mean min/max/avg seq length but something else:

You could check with the VSEARCH developers to confirm the latter — if you do, please report back here!

Yes, I am trying to figure it out too, my file seems to be the correct one… I noticed there’s someone else who had the exact same numbers as me from this thread:

cluster-features-open-reference Memory / disk space error

so either we have the same sequences with the same descriptive statistics, or perhaps it is describing the statistics of the SILVA file… Which is weird because when I ran the average read length size it says it’s 1448bp, not 1409. I’ll keep looking and get back to you.

Okay thanks for the link — the full log there makes it clear that those counts are for the reference DB, not the query seqs.

My advice above still stays the same though:

  1. if you use closed-ref clustering, your rep seqs become those full-length reference seqs. So not using open/closed ref in your workflow will leave you with more seqs but shorter, quicker-to-classify ASVs.
  2. using multiple threads (if you can) will cut down time substantially.

Looks like you have 6280 ASVs… I suggest plugging those in and see where it takes you.

Good luck!

Yes! That helped a lot! Turning on the threads reduced my running time significantly! I also ran it in Blast but that one was without any threads set, however I am assuming it was much faster because unlike vsearch, blast performs a local alignment. Thanks a lot! :smiley:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.