Hi all
I’ve been trying to work out a smooth method to analyze a large data set of sequences. In short, it contains 30 samples with 389,340 features and a total frequency of 2,877,374.
I expect this data set to take a while to run through the QIIME 2 workflow, especially in processes necessary to generate the tree but another person was able to run the data through QIIME 1 within a day. Looking at the mask step alone, I’ve had it running 4 days through command line.
qiime alignment mafft
–i-sequences rep-seqs.qza
–o-alignment aligned-rep-seqs.qza
–output-dir alignment
–p-n-threads 0
–p-parttree
–verbose
That's a lot of features! Perhaps you should review your upstream steps: are you doing OTU clustering or denoising? If denoising, make sure you trimmed off adapters first.
If you feel pretty good about the upstream processing, I recommend at least filtering out low-abundance features, e.g., singletons; use qiime feature-table filter-features to remove these from your feature table, and then filter-seqs to remove those same features from your sequence file. You could also use something like qiime quality-control exclude-seqs to filter out any sequences that do not resemble a set of reference sequences (use a small set of reference sequences for this).
This is not really a comparison of QIIME 1 vs. QIIME 2, because you are comparing 2 different datasets and this other person may just have a dataset with fewer unique sequences.
But even so, QIIME 1 and 2 have different alignment methods so differences in runtime would not surprise me...
All in all, this is because you have so many sequences. And I am worried that you may have done something wrong in the upstream steps, because 389,340 features and a total frequency of 2,877,374 is a bit suspicious.
Thank you so much for responding. I've had the same concerns.
Based on other forums posts I did think these values were off by quite a bit.
I'll try to be. brief in describing the method(s) I've tried.
The sequences I receive are already pre-processed by our Illumina sequencer and so I upload them to Qiime2 using qiime tools import --type 'SampleData[PairedEndSequenceswithQuality' --input-format CasavaOneEightSingleLanePerSampleDirFmt.
qiime demux summarize shows the data is high quality and I was encouraged to go directly to read joining using qiime vsearch join-pairs and qiime search dereplicate-sequences. Clustering was ran using qiime vsearch uchime-denovo followed by qiime metadata tabulate to prepare for chimera check. Chimera removal on the uchime results were performed using qiime feature-table filter-features and qiime feature-table filter-seqs.
I have not tried this yet. Would you suggest running this on my data after chimera filtering?
Since you are using an OTU clustering protocol, you should first use qiime quality-filter q-score, and then use qiime feature-table filter-features to filter out low-abundance sequences after chimera checking; see here for a description of both methods and their rationale for use in OTU clustering protocols.
This is not necessary, but will really help if you suspect you have large amounts of non-target DNA (e.g., host or plant DNA) in your samples, or noisy sequences left over post-OTU clustering. Yes, use it after chimera checking if you plan to do this (note: it will be another time-consuming step. Not as time-consuming as multiple sequence alignment of superfluous sequences)
Hi Nicholas, I hope you're doing well!
I wanted to report back after I followed your suggestions, well except for the qiime quality-control exclude-seqs.
I see you are using filter-seqs but only to remove chimera; you should also add an abundance filter as described in that article.
why not use a denoising method instead of OTU clustering? (you don't need to answer that; it's just a rhetorical question implying that denoising should reduce the large number of noisy OTUs instead of relying on an abundance filter to do the trick).
Honestly it is because the group I work with has been using OTU clustering with vsearch in the past (specifically with qiime 1) and requested that I work out a method that relies on vsearch with qiime 2.
I don't want to get involved in inter-group politics , but let me at least clarify this:
QIIME 1 did not use vsearch, it used usearch (or uclust? cannot remember). Should be "close enough" methodologically speaking, but not "the same thing" if your groups is actually trying to replicate the pipeline.