Exclude-seqs error with imported FASTA file

Mehrbod_Estaki · March 6, 2019, 9:20am

qiime2 2019.1 on VM Version 5.2.26.

I'm creating a representative sequence FASTA file in R from a dada2 feature table.

uniquesToFasta(seqtab.nochim, fout='rep-seqs.fna')

This gives me a FASTA file such as:

 >sq1;size=9172;
GGAATCGGGATTTCTCATTTCCTGACTAGGGTGATTCTGTCAGCGGTTGATGAATCATTTCACACTCTCTGCTCATTTTGCAATCACTTTTCCACTGCGATCTAAAGCAGATGCTGCAGGTGAGCAAGTATATGCAAGTCCCAGGATGTTGTTTTGTATCCAGAAGGCAGTGGCTGCCTTCCTTCCCCCCTCTTTTGTTGTCCTGGGTGGCAAAAAGCTTCAAGATCTAGCCTTTTCGAATGACAGAGGTTTGGACTGTGTGTTCCAAGTCTGATTTGAAACCAGACAGCTTTTAATATCTGTGAAAATCTGCTGCAGATGTTCAACAAGCAGATTCCCCCTAAAATAAAGCTTTATTCATCCTCCCAGGAAATGGTTCTGCACAGCCAGGAAGAGATACCCCAGTAGTC
>sq2;size=7551;
GGAATCGGGATTTCTCATTTCCTGACTAGGGTGATTCTGTCAGCGGTTGATGAATCATTTCACACTCTCTGCTCATTTTGCAATCACTTTTCCACTGCGATCTAAAGCAGATGCTGCAGGTGAGCAAGTATATGCAAGTCCCAGGATGTTGTTTTGTATCCAGAAGGCAGTGGCTGCCTTCCTTCCCCCCTCTTTTGTTGTCCTGGGTGGCAAAAAGCTTCAAGATCTAGCCTTTTCGAATGACAGAGGTTTGGACTGTGTGTTCCAAGTCTGATTTGAAACCAGACAGCTTTTAATATCTGTGAAAATCTGCTGCAGATGTTCAACAAGCAGATTCCCCCTAAAATAAAGCTTTATTCATCCTCCCAGGAAATGGTTCTGCACAGCCAGGAAGAGATACCCCTGTAGTC

I successfully import this as:

qiime tools import \
  --input-path rep-seqs.fna\
  --output-path rep-seq.qza \
  --type 'FeatureData[Sequence]'

Then I try to run q2-exclude-seqs:

 qiime quality-control exclude-seqs \
  --i-query-sequences rep-seq.qza \
  --i-reference-sequences ../88_otus.qza \
  --p-method vsearch \
  --p-perc-identity 0.65 \
  --p-perc-query-aligned 0.60 \
 --p-threads 6 \
  --o-sequence-hits hits.qza \
  --o-sequence-misses misses.qza \
  --verbose

The reference sequences are the greengenes 88_otus and is good as I've successfully used this as recent as q2-2018.11

Then after a few minutes I get an error:

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2-archive-n8v9h7tt/f2d86460-19ac-4297-937a-123ec4495b91/data/dna-sequences.fasta --id 0.65 --strand both --maxaccepts 1 --maxrejects 0 --db /tmp/qiime2-archive-347oj9gd/cff8af2a-7a2a-4cd2-b45b-91bfc2b971a1/data/dna-sequences.fasta --threads 6 --userfields query+target+ql+qlo+qhi --userout /tmp/tmpgnhe27wx

vsearch v2.7.0_linux_x86_64, 10.2GB RAM, 6 cores
torognes (Torbjørn Rognes) · GitHub vsearch #edited for vanity

Reading file /tmp/qiime2-archive-347oj9gd/cff8af2a-7a2a-4cd2-b45b-91bfc2b971a1/data/dna-sequences.fasta 100%
15080850 nt in 10544 seqs, min 1258, max 2353, avg 1430
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching query sequences: 2210 of 2863 (77.19%)
Traceback (most recent call last):
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "</home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/decorator.py:decorator-gen-192>", line 2, in exclude_seqs
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 365, in callable_executor
output_views = self._callable(**view_args)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/quality_control.py", line 32, in exclude_seqs
perc_query_aligned=perc_query_aligned, method=method)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_blast.py", line 28, in _search_seqs
return _generate_assignments(cmd, perc_query_aligned)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_blast.py", line 67, in _generate_assignments
hits = _extract_hits(output.name, perc_query_aligned)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_blast.py", line 85, in _extract_hits
query_id, subject_id, query_len, start, end = line.split('\t')
ValueError: not enough values to unpack (expected 5, got 1)

Plugin error from quality-control:
not enough values to unpack (expected 5, got 1)

See above for debug info.

My initial hunch was that there was something amiss with the imported FASTA's description line. The IDs there are automatically created and are uchime-compatible according to the manual. I thought maybe the ; threw everything off so I tried importing a new FASTA file with the descriptor line exactly the same as the sequences themselves without ; and I received the same error.

Any help would be appreciated!

Nicholas_Bokulich · March 6, 2019, 1:19pm

Hi @Mehrbod_Estaki,
Thanks for sending this minimum working example. Unfortunately, I am not able to replicate your error locally — all is working fine for me. Any chance you could DM me the complete file?

Is this file working with 2018.11? Nothing changed in this plugin between then and now, so I would be very surprised if this is related to the version but it does not hurt to check.

Thanks!

timanix · March 6, 2019, 2:58pm

Hi!
I encountered some time ago the same mistake with my dataset, and it worked after putting higher

parameters. I don't know if it is connected or how. After a while, I redid it with the same parameters when I received a mistake and this time it worked.
What is the cause, no idea.

Mehrbod_Estaki · March 6, 2019, 6:28pm

No problem, I'll DM you shortly @Nicholas_Bokulich.

I was actually only referring to the 88_otus.qza file as a potential source of error but I can check with an older version if you can't replicate the error.

Nicholas_Bokulich · March 6, 2019, 10:29pm

Hey @Mehrbod_Estaki,
Nice cryptic issue. The problem is that your fasta file has CRLF (windows-style) line endings, most importantly at the ends of the fasta header lines. VSEARCH is interpreting these to be part of the sample ids, and so the vsearch alignment report gets FUBAR... sample ids and alignment stats get broken onto separate lines. This appears to be a wider problem with vsearch:

You can fix by using mac2unix or tr -d "\r" < file or some other method to convert CRLF to CR line endings.

In a future release of QIIME 2 these invisible line endings will be detected on import, since they are likely to wreak havoc with other methods, not just anything based on vsearch.

Thanks for reporting!

Mehrbod_Estaki · March 7, 2019, 12:34am

Wow, that's a good solve! Thanks @Nicholas_Bokulich.

A simple dos2unix rep-seqs.fna did the trick here. Too easy.