I am completely new to QIIME 2, but so far I’m loving it. I’ve been running some commands on it, and it seems to be pretty smooth. My data is divided into three different cohorts: c.difficile + patients (n=176), c.difficile - patients but with diarrhea (n=176), and 176 healthy controls taken from SRA from a Japanese cohort of healthy individuals. In total I have around 14 GB of data. I am currently doing a microbiome analysis of the 16S rRNA v3-v4 region. We did this by sequencing in Illumina MiSeq 300 in PE mode. My data was demultiplexed by the sequencing centre. So far I’ve done the following:
Created a metadata file for the fastq names and individual information.
Created a fastq-manifest.csv file for my fastq reads, given that the format is a bit different from EMP and Casava1.8; command: qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path fastq_manifest.csv --output-path pe_demux.qza --input-format PairedEndFastqManifestPhred33
Joined and merge through vsearch; command: qiime vsearch join-pairs --i-demultiplexed-seqs pe_demux.qza --o-joined-sequences joined_reads
5.1 Dereplication(I decided to try both denoising and dereplication, in parallel); command: qiime vsearch dereplicate-sequences joined_reads.qza --o-dereplicated-table dereplicated_vsearch_table.qza --o-dereplicated-sequences dereplicated_sequences_vsearch.qza
That's a lot of seqs, and very long seqs at that. These are all around full-length 16S... doing global alignment of 369953 full-length 16S seqs against SILVA full-length seqs (probably a similar number) will take quite a bit of time.
But you can do it faster. First off, I see an issue here:
I am guessing this is why your query seqs appear to be full-length 16S (I don't actually see where non_chimeras.qza came from in your workflow, so not sure what you are really classifying but that's my guess). A couple issues:
open-ref OTU clustering is performing closed-ref OTU clustering as a first step, clustering your query seqs into full-length 16S reference seqs, which then become the representative seqs for that OTU. This is why you have full-length query seqs, they are actually the SILVA reference sequences that you clustered into.
You do not need to do OTU clustering of any type after denoising... deblur is already denoising and dereplicating these reads, OTU clustering (while optional) will just reduce the amount of information that you have, i.e., by transforming ASVs (actual sequence variants) into OTUs. You probably know this but I just want to make sure that this is intentional.
Not clustering would mean you have shorter (V3-V4) ASVs, which would classify much faster.
Second, you could use the --p-n-threads parameter with classify-consensus-vsearch to parallelize this job and speed things up N-fold.
It sounds like you have a large amount of very long sequences so the long runtime is not unexpected. The steps I've advised above will reduce the wait.
Oh god! Ok, I see that I was running an extra clustering step that was redundant to what I had. Also, yes, I forgot to add a step before the taxonomy classifier:
I will be rerunning this again but without the clustering step that I ran (step 7), and jump into the taxonomy classifier and see if my reads have a more expected size (~450 bp). Also, I encountered in some of my readings that dechimerization already occurs in Deblur, should I run it like I did in step 7.1 and 7.2? Or would that be also redundant for my taxonomy classification?
Sounds good! You will have more reads (since they won't be clustered at 99%) but they will be shorter, so that will help.
That's right, deblur already does chimera filtering so 7.1 is unnecessary even if you do want to cluster your ASVs.
Taxonomy classification will still take some time — you have a lot of reads from a bunch of studies! — but hopefully a lot less time (especially if you use --p-n-threads)
Hi again! It's me... Thank you for your fast and helpful responses! I keep having the same issue when I run the feature-taxonomy classification... I have decided to run everything again from start. These are my commands so far:
The output from the last step was (from the first description): 521145303 nt in 369953 seqs, min 900, max 2961, avg 1409
As it can be seen, I have removed the chimera removal with vsearch, as well as the clustering step I put before, but the average read length seems to be 1409, when I am expecting something like 450bp. I have uploaded the visualization files to see if that clarifies things... Thank you so much for the patience!
Yes, I am trying to figure it out too, my file seems to be the correct one… I noticed there’s someone else who had the exact same numbers as me from this thread:
so either we have the same sequences with the same descriptive statistics, or perhaps it is describing the statistics of the SILVA file… Which is weird because when I ran the average read length size it says it’s 1448bp, not 1409. I’ll keep looking and get back to you.
Okay thanks for the link — the full log there makes it clear that those counts are for the reference DB, not the query seqs.
My advice above still stays the same though:
if you use closed-ref clustering, your rep seqs become those full-length reference seqs. So not using open/closed ref in your workflow will leave you with more seqs but shorter, quicker-to-classify ASVs.
using multiple threads (if you can) will cut down time substantially.
Looks like you have 6280 ASVs… I suggest plugging those in and see where it takes you.
Yes! That helped a lot! Turning on the threads reduced my running time significantly! I also ran it in Blast but that one was without any threads set, however I am assuming it was much faster because unlike vsearch, blast performs a local alignment. Thanks a lot!