Steps for taxonomy assignment on dual-indexed sequence data generated by QIIME 1.9

ismailp · February 13, 2020, 10:05pm

Hi,

I am trying QIIME 2 on one of my wife's experiments that she had processed with QIIME 1.9 in 2015. She used dual-indexed primers in her experiment as published in An improved dual-indexing approach for multiplexed 16S rRNA gene sequencing on the Illumina MiSeq platform | Microbiome | Full Text (Fadrosh, D.W., Ma, B., Gajer, P. et al.).

I didn't try QIIME 2 for the whole preprocessing steps because it was unclear to me whether it's easy to do. Instead, I used joined paired-ends generated by QIIME 1.9 (that is, a trimmed seqs.fna).

Here's my setup:
% conda --version
conda 4.8.1
% conda env list
# conda environments:
#
base * /home/svcqiime/miniconda3
qiime2-2019.10 /home/svcqiime/miniconda3/envs/qiime2-2019.10

% uname -a
Linux qiimem 5.4.0-050400-generic #201911242031 SMP Mon Nov 25 01:35:10 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

For the compatibility reasons, I installed a Ubuntu 18.04 LTS on my computer. It's running on bare metal, and not under a hypervisor.

% cat /etc/issue
Ubuntu 18.04.4 LTS \n \l

The machine has an AMD Ryzen 9 3950x CPU (16 cores, 32 threads, not overclocked), 32GB RAM, 60GB swap, about 400GB free space on the partition that is supposed to process the data, and about 156GB /tmp partition.

Steps I took are from OTU Clustering tutorial:

Import the QIIME 1.9 processed FASTA:
% qiime tools import --input-path input.fasta
--output-path imported.qza --type 'SampleData[Sequences]'

Dereplicate:
% qiime vsearch dereplicate-sequences
--i-sequences imported.qza
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

Import SILVA:
% qiime tools import --input-path SILVA_132_QIIME_release/rep_set/rep_set_16S_only/97/silva_132_97_16S.fna --output-path silva_132_97 --type 'FeatureData[Sequence]'
% qiime tools import --input-path SILVA_132_QIIME_release/taxonomy/16S_only/97/taxonomy_7_levels.txt --input-format HeaderlessTSVTaxonomyFormat --output-path silva_132_97_ref_taxonomy --type 'FeatureData[Taxonomy]'

Do open-reference OTU clustering
% qiime vsearch cluster-features-open-reference --i-table table.qza --i-sequences rep-seqs.qza
--i-reference-sequences silva_132_97.qza --p-perc-identity 0.97
--o-clustered-table table-or-97.qza --o-clustered-sequences rep-seqs-or-97.qza
--o-new-reference-sequences new-ref-seqs-or-97.qza

Assigning taxonomy:
% qiime feature-classifier classify-consensus-vsearch --i-query rep-seqs.qza
--i-reference-reads new-ref-seqs-or-97.qza --i-reference-taxonomy silva_132_97_ref_taxonomy.qza
--p-threads 20 --p-maxaccepts all --o-classification consensus-classified.qza

After 5 days and 23 hours, the last plugin filled up /tmp partition with a 156GB temporary file. vsearch didn't notice that it had actually failed and cannot produce any meaningful output; there is no space to write any output. I ran htop, and saw that vsearch continued execution, occupying 20 CPU threads. After briefly taking a look at its source code, I saw that they don't handle file system errors, except fopens.
https://github.com/torognes/vsearch/search?q=fprintf&unscoped_q=fprintf
fclose calls aren't checked either. That's a vsearch bug, and I plan to create an issue on vsearch.

QIIME creates a temporary file name, and passes to vsearch. It might be important to document this behavior, and the destination directory can be specified externally by setting TMPDIR, TEMP or TMP environment variables. I am planning to re-run this experiment by setting TMPDIR environment variable, but I would like to know whether I am taking right steps until taxonomy assignment.

% uptime
 23:00:14 up 5 days,  3:59,  6 users,  load average: 0.04, 0.36, 4.87

Here's an excerpt from the file:
% head /tmp/tmp3iq6ls_e
5cc706cbf283249918f00da69c7e32a7a15cf45d LMPG01000001.429.1927 99.8 404 1 0 1 404 1 1472 -1 0
5cc706cbf283249918f00da69c7e32a7a15cf45d FPLS01001796.15.1482 99.5 404 2 0 1 404 1 1468 -1 0
5cc706cbf283249918f00da69c7e32a7a15cf45d GQ385284.1.1439 99.5 404 2 0 1 404 1 1433 -1 0
5cc706cbf283249918f00da69c7e32a7a15cf45d GQ389082.1.1461 99.5 404 2 0 1 404 1 1457 -1 0
5cc706cbf283249918f00da69c7e32a7a15cf45d HG005350.1.1480 99.5 404 2 0 1 404 1 1465 -1 0
5cc706cbf283249918f00da69c7e32a7a15cf45d JN196137.1.1414 99.5 404 2 0 1 404 1 1414 -1 0
5cc706cbf283249918f00da69c7e32a7a15cf45d JN628330.1.1446 99.5 404 2 0 1 404 1 1445 -1 0
5cc706cbf283249918f00da69c7e32a7a15cf45d KM263160.1.1349 99.5 404 2 0 1 404 1 1348 -1 0
5cc706cbf283249918f00da69c7e32a7a15cf45d EU835403.1.1351 99.3 405 2 1 1 404 1 1351 -1 0
5cc706cbf283249918f00da69c7e32a7a15cf45d AY328628.1.1479 99.3 404 3 0 1 404 1 1467 -1 0

I would be very happy, if you could help me regarding the steps I need to take.

Best regards,
Ismail

Nicholas_Bokulich · February 14, 2020, 2:42pm

Welcome to the forum @ismailp!

There are 3 issues with your workflow that would contribute to your issue (running out of tmp space). You are using OTU clustering (instead of more recent denoising methods, which tend to weed out more of the noisy sequences). Using a denoiser would by itself alleviate this issue, but the issue here is that when using OTU clustering you should:

pre-filter the data (prior to dereplication) to remove noisy sequences — maybe you already did this in QIIME 1? (as part of demultiplexing) If not, you should.
After OTU clustering you should perform abundance-based filtering to remove the low-abundance OTUs, which are usually spurious.

Both of these issues and their solutions are described in this article.

You should also perform chimera filtering as part of any OTU clustering tutorial. q2-vsearch (same plugin you used for clustering) has a chimera filtering method.

q2-feature-classifier also has other taxonomy classification methods... you could give those a try if the VSEARCH-based classifier is not to your liking.

I'd say yes it is! No more challenging than, e.g., running those steps in QIIME 1. But that's not the cause of your issue.

I hope that helps!

ismailp · February 18, 2020, 9:52pm

Thanks for the swift response. As I said, my wife had already used QIIME 1. I just was just curious about QIIME 2; would it somehow generate "better" results? (for some definition of better)

From what I gathered, she didn't do denoising as the author of that dual-indexed primer paper didn't do, and they didn't say that was necessary. She ran chimera slayer, but it hadn't removed as many sequences. filter_otus_from_otu_table.py wasn't used. Rest of the analysis, after taxonomy assignment, was done in R.

I don't have anything against vsearch. If it works, then I can use that. It has a serious bug, that's for sure.

QIIME 2 tutorial wasn't as encouriging

Thanks, I might try again with a different approach.

Nicholas_Bokulich · February 19, 2020, 11:49pm

Yes — the denoising methods and taxonomic classification methods wrapped/implemented in QIIME 2 are a methodological advancement over the OTU clustering and taxonomic classification methods used in QIIME 1. "Better" in this case translates to lower error rates, better resolution of unique sequence variants (as ASVs), and more accurate taxonomic classification.

However, by using cluster-features-open-reference you are effectively replicating the QIIME 1 protocol... VSEARCH (qiime2) may produce slightly different results from uclust (qiime1), but not necessarily better in the sense I've described above.

So I recommend giving the denoising methods in QIIME 2 a try! Not only will they generate better results, they should also eliminate the issue you encountered with q2-vsearch (aside from the fact that you won't be using vsearch, it will also reduce the OTU bloat that I suspect led to the tmp space error you described).

Good luck!

system · March 22, 2020, 5:49am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.