QIIME 1 tree building versus Qiime2 tree building

(Trenton Kaine O'Neal) #1

Hi all
I’ve been trying to work out a smooth method to analyze a large data set of sequences. In short, it contains 30 samples with 389,340 features and a total frequency of 2,877,374.

I expect this data set to take a while to run through the QIIME 2 workflow, especially in processes necessary to generate the tree but another person was able to run the data through QIIME 1 within a day. Looking at the mask step alone, I’ve had it running 4 days through command line.
qiime alignment mafft
–i-sequences rep-seqs.qza
–o-alignment aligned-rep-seqs.qza
–output-dir alignment
–p-n-threads 0

qiime alignment mask
–i-alignment aligned-rep-seqs.qza
–o-masked-alignment alignment/masked-aligned-rep-seqs.qza

qiime phylogeny fasttree
–i-alignment alignment/masked-aligned-rep-seqs.qza
–o-tree alignment/unrooted-tree.qza
–n-p-threads 0

I’ve also attempted to use qiime fragment-insertion sepp
using default parameters to no avail.

Can anyone suggest what I might be doing incorrect in comparison to their QIIME 1 run?

Thank you for any suggestions or comments.

(Nicholas Bokulich) #2

Welcome @TKOneal!

That’s a lot of features! Perhaps you should review your upstream steps: are you doing OTU clustering or denoising? If denoising, make sure you trimmed off adapters first.

If you feel pretty good about the upstream processing, I recommend at least filtering out low-abundance features, e.g., singletons; use qiime feature-table filter-features to remove these from your feature table, and then filter-seqs to remove those same features from your sequence file. You could also use something like qiime quality-control exclude-seqs to filter out any sequences that do not resemble a set of reference sequences (use a small set of reference sequences for this).

This is not really a comparison of QIIME 1 vs. QIIME 2, because you are comparing 2 different datasets and this other person may just have a dataset with fewer unique sequences.

But even so, QIIME 1 and 2 have different alignment methods so differences in runtime would not surprise me…

All in all, this is because you have so many sequences. And I am worried that you may have done something wrong in the upstream steps, because 389,340 features and a total frequency of 2,877,374 is a bit suspicious.

Let us know what you find!

1 Like
(Trenton Kaine O'Neal) #3

Thank you so much for responding. I’ve had the same concerns.

Based on other forums posts I did think these values were off by quite a bit.

I’ll try to be. brief in describing the method(s) I’ve tried.
The sequences I receive are already pre-processed by our Illumina sequencer and so I upload them to Qiime2 using qiime tools import --type 'SampleData[PairedEndSequenceswithQuality' --input-format CasavaOneEightSingleLanePerSampleDirFmt.

qiime demux summarize shows the data is high quality and I was encouraged to go directly to read joining using qiime vsearch join-pairs and qiime search dereplicate-sequences. Clustering was ran using qiime vsearch uchime-denovo followed by qiime metadata tabulate to prepare for chimera check. Chimera removal on the uchime results were performed using qiime feature-table filter-features and qiime feature-table filter-seqs.

I have not tried this yet. Would you suggest running this on my data after chimera filtering?

Thanks again for any help!

(Nicholas Bokulich) #4

Since you are using an OTU clustering protocol, you should first use qiime quality-filter q-score, and then use qiime feature-table filter-features to filter out low-abundance sequences after chimera checking; see here for a description of both methods and their rationale for use in OTU clustering protocols.

This is not necessary, but will really help if you suspect you have large amounts of non-target DNA (e.g., host or plant DNA) in your samples, or noisy sequences left over post-OTU clustering. Yes, use it after chimera checking if you plan to do this (note: it will be another time-consuming step. Not as time-consuming as multiple sequence alignment of superfluous sequences)

(Trenton Kaine O'Neal) #5

Thanks for the suggestion, I will give this a go and report back.