QIIME 1 tree building versus Qiime2 tree building

TKOneal · May 17, 2019, 10:32pm

Hi all
I've been trying to work out a smooth method to analyze a large data set of sequences. In short, it contains 30 samples with 389,340 features and a total frequency of 2,877,374.

I expect this data set to take a while to run through the QIIME 2 workflow, especially in processes necessary to generate the tree but another person was able to run the data through QIIME 1 within a day. Looking at the mask step alone, I've had it running 4 days through command line.
qiime alignment mafft
--i-sequences rep-seqs.qza
--o-alignment aligned-rep-seqs.qza
--output-dir alignment
--p-n-threads 0
--p-parttree
--verbose

qiime alignment mask
--i-alignment aligned-rep-seqs.qza
--o-masked-alignment alignment/masked-aligned-rep-seqs.qza

qiime phylogeny fasttree
--i-alignment alignment/masked-aligned-rep-seqs.qza
--o-tree alignment/unrooted-tree.qza
--n-p-threads 0

I've also attempted to use qiime fragment-insertion sepp
using default parameters to no avail.

Can anyone suggest what I might be doing incorrect in comparison to their QIIME 1 run?

Thank you for any suggestions or comments.

Nicholas_Bokulich · May 20, 2019, 12:13pm

Welcome @TKOneal!

That's a lot of features! Perhaps you should review your upstream steps: are you doing OTU clustering or denoising? If denoising, make sure you trimmed off adapters first.

If you feel pretty good about the upstream processing, I recommend at least filtering out low-abundance features, e.g., singletons; use qiime feature-table filter-features to remove these from your feature table, and then filter-seqs to remove those same features from your sequence file. You could also use something like qiime quality-control exclude-seqs to filter out any sequences that do not resemble a set of reference sequences (use a small set of reference sequences for this).

This is not really a comparison of QIIME 1 vs. QIIME 2, because you are comparing 2 different datasets and this other person may just have a dataset with fewer unique sequences.

But even so, QIIME 1 and 2 have different alignment methods so differences in runtime would not surprise me...

All in all, this is because you have so many sequences. And I am worried that you may have done something wrong in the upstream steps, because 389,340 features and a total frequency of 2,877,374 is a bit suspicious.

Let us know what you find!

TKOneal · May 20, 2019, 3:54pm

Thank you so much for responding. I've had the same concerns.

Based on other forums posts I did think these values were off by quite a bit.

I'll try to be. brief in describing the method(s) I've tried.
The sequences I receive are already pre-processed by our Illumina sequencer and so I upload them to Qiime2 using qiime tools import --type 'SampleData[PairedEndSequenceswithQuality' --input-format CasavaOneEightSingleLanePerSampleDirFmt.

qiime demux summarize shows the data is high quality and I was encouraged to go directly to read joining using qiime vsearch join-pairs and qiime search dereplicate-sequences. Clustering was ran using qiime vsearch uchime-denovo followed by qiime metadata tabulate to prepare for chimera check. Chimera removal on the uchime results were performed using qiime feature-table filter-features and qiime feature-table filter-seqs.

I have not tried this yet. Would you suggest running this on my data after chimera filtering?

Thanks again for any help!

Nicholas_Bokulich · May 20, 2019, 4:01pm

Since you are using an OTU clustering protocol, you should first use qiime quality-filter q-score, and then use qiime feature-table filter-features to filter out low-abundance sequences after chimera checking; see here for a description of both methods and their rationale for use in OTU clustering protocols.

This is not necessary, but will really help if you suspect you have large amounts of non-target DNA (e.g., host or plant DNA) in your samples, or noisy sequences left over post-OTU clustering. Yes, use it after chimera checking if you plan to do this (note: it will be another time-consuming step. Not as time-consuming as multiple sequence alignment of superfluous sequences)

TKOneal · May 20, 2019, 4:51pm

Thanks for the suggestion, I will give this a go and report back.

TKOneal · May 30, 2019, 3:52pm

Hi Nicholas, I hope you're doing well!
I wanted to report back after I followed your suggestions, well except for the qiime quality-control exclude-seqs.

qiime vsearch join-pairs \
  --i-demultiplexed-seqs demux-filtered.qza \
  --o-joined-sequences joined-seqs.qza

qiime quality-filter q-score-joined \
    --p-min-quality 3 \
    --p-quality-window 3 \
    --p-min-length-fraction 0.75 \
    --p-max-ambiguous 0 \
    --i-demux joined-seqs.qza \
    --o-filtered-sequences filtered-joined-seqs.qza \
    --o-filter-stats filtered-joined-seqs-stats.qza \
    --verbose

qiime vsearch dereplicate-sequences \
  --i-sequences filtered-joined-seqs.qza \
  --o-dereplicated-table vsearch-table.qza \
  --o-dereplicated-sequences vsearch-rep-seqs.qza \
  --verbose

qiime vsearch uchime-denovo \
	--i-table vsearch-table.qza \
	--i-sequences vsearch-rep-seqs.qza \
	--output-dir uchime-dn-out

qiime feature-table filter-features \
	--i-table vsearch-table.qza \
	--m-metadata-file uchime-dn-out/nonchimeras.qza \
	--o-filtered-table uchime-dn-out/filtered-table.qza

qiime feature-table filter-seqs \
	--i-data rep-seqs.qza \
	--m-metadata-file uchime-dn-out/nonchimeras.qza \
	--o-filtered-data uchime-dn-out/rep-seqs-nonchimeric-wo-borderline.qza

qiime feature-table summarize \
	--i-table uchime-dn-out/filtered-table.qza \
	--o-visualization uchime-dn-out/table-nonchimeric-wo-borderline.qzv

then renamed the filtered-table and nonchimeras.qza to table.qza and rep-sees.qza.

A check of the two artifacts shows I now have 388,811 features with 2,876,568 frequencies. This is across 30 of the 32 samples we sequenced.

I'm going to run the quality-control exclude-seqs against the GreenGenes 99% database and see what happens.

Nicholas_Bokulich · May 30, 2019, 4:30pm

no improvement?

you are missing another critical step:

I see you are using filter-seqs but only to remove chimera; you should also add an abundance filter as described in that article.

why not use a denoising method instead of OTU clustering? (you don't need to answer that; it's just a rhetorical question implying that denoising should reduce the large number of noisy OTUs instead of relying on an abundance filter to do the trick).

TKOneal · May 30, 2019, 7:32pm

Honestly it is because the group I work with has been using OTU clustering with vsearch in the past (specifically with qiime 1) and requested that I work out a method that relies on vsearch with qiime 2.

Nicholas_Bokulich · May 30, 2019, 7:46pm

I don't want to get involved in inter-group politics , but let me at least clarify this:

QIIME 1 did not use vsearch, it used usearch (or uclust? cannot remember). Should be "close enough" methodologically speaking, but not "the same thing" if your groups is actually trying to replicate the pipeline.

TKOneal · May 30, 2019, 7:57pm

Right, sorry they used Uclust. The previous script record I've been given is a mixtures of steps used for 16S and ITS plus random notes.

The main qiime 1 workflow seemed to be the following:

multiple_join_paired_ends.py -i R1R2/ -o Joined/

multiple_split_libraries_fastq.py -i Joined/ -o MSLout --include_input_dir_path

pick_open_reference_otus.py -i MSLout/seqs.fna -o uclust_otus/ --suppress_step4

biom summarize-table -i uclust_otus/otu_table_mc2_w_tax_no_pynast_failures.biom

This is what I am attempting to recreate using Qiime 2 as well as updating the method where I can.

TKOneal · May 31, 2019, 9:22pm

Hi Nicholas,

So it turns out I messed up here :

my --i-data should have been my vsearch-rep-seqs.qza file instead. vsearch-table =30 sample, 444,775 features, 3,045,846 frequency

From where I stopped in my previous list of commands, I ran

qiime feature-table filter-seqs \
    	--i-data vsearch-rep-seqs.qza \
      --i-table filtered-table.qza \
    	--m-metadata-file uchime-dn-out/nonchimeras.qza \
    	--o-filtered-data rep-seqs-nonchimeric.qza

  qiime feature-table filter-seqs \
  	--i-data rep-seqs-nonchimeric.qza \
    --i-table filtered-table.qza \
  	--o-filtered-data filtered-rep-seqs.qza

qiime quality-control exclude-seqs \
--i-query-sequences filtered-rep-seqs \
--i-reference-sequences gg_99_otus.qza \
--p-method vsearch \
--p-perc-identity 0.65 \
--p-perc-query-aligned 0.60 \
--p-threads 10 \
--o-sequence-hits hits.qza \
--o-sequence-misses misses.qza \
--verbose

qiime feature-table filter-features \
  --i-table filtered-table.qza \
  --m-metadata-file hits.qza \
  --o-filtered-table no-hits-filtered-table.qza \
  --p-exclude-ids

after the above steps= 30 sample, 388,811 features, 2,876,568 frequency

I'm still not sure about how to abundance-filter for the 0.005% stated in your above paper.

Thanks for all of your help and guidance!

Nicholas_Bokulich · May 31, 2019, 9:26pm

0.005% * 2,876,568

use the min-frequency parameter in filter-features.

Good luck!

system · July 2, 2019, 3:26am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.