vsearch classification, Fatal error: Invalid FASTA

Alex_Umbach · March 31, 2021, 7:58pm

Hello,

Long-time user, first-time poster.

I'm working with a non-16S dataset for which I have developed a reference database and the associated taxonomic information. Both of these have been imported as artifact files using the --type 'FeatureData[Sequence]' and 'FeatureData[Taxonomy]' options, and have been used successfully in training a naive-Bayes classifier.

However, I'm also interested in using the vsearch method in order to reduce the stringency of my taxonomic assignment and to compare the classification methods.

I have used the following command:

qiime feature-classifier classify-consensus-vsearch
--i-query cpn60_rep_seqs.qza
--i-reference-reads cpn60_refseqs_final.qza
--i-reference-taxonomy cpn60_taxonomy_final.qza
--p-maxaccepts 5
--p-perc-identity 0.97
--p-top-hits-only TRUE
--p-threads 2
--o-classification cpn60_vsearch_taxonomy.qza
--verbose

and am receiving the following error:

"Command: vsearch --usearch_global /tmp/qiime2-archive-p854sa1k/f9468e64-1197-4218-b50e-dd9b7bbc0e28/data/dna-sequences.fasta --id 0.97 --query_cov 0.8 --strand both --maxaccepts 5 --maxrejects 0 --output_no_hits --db /tmp/qiime2-archive-ej5n_loa/0b738b91-7d24-4e29-903c-c8c793595dff/data/dna-sequences.fasta --threads 2 --top_hits_only --blast6out /tmp/tmp4kjexb30

vsearch v2.7.0_linux_x86_64, 125.9GB RAM, 16 cores
GitHub - torognes/vsearch: Versatile open-source tool for microbiome analysis

Reading file /tmp/qiime2-archive-ej5n_loa/0b738b91-7d24-4e29-903c-c8c793595dff/data/dna-sequences.fasta 0%

Fatal error: Invalid FASTA - header must be terminated with newline"

I have validated the 'FeatureData[Sequence]' file using the qiime tools validate command and have double checked for correct formatting. This file did work previously using the naïve-Bayes classifier, so I'm not sure why its bad now.

The fasta file used to generate the FeatureData[Sequence] artifact contains sequences in the following format (with a > in the front of each seq_id):

seq_id
seq_info
seq_id2
seq_info

Any recommendations would be appreciated! Thank you.

Nicholas_Bokulich · April 2, 2021, 3:31pm

Hey @Alex_Umbach ,
Similar issues have been reported on the forum — to put it in a nutshell, vsearch's fasta requirements are a bit more stringent than sklearn and QIIME 2 (which is why you could fit a sklearn classifier and validate with QIIME 2 but not use the vsearch-based classifier). Specifically, vsearch does not recognize some line ending characters that are used on different systems. The error message says it all:

I am guessing these files have windows-style linebreaks, or worse, if a Microsoft Office product was used to create these files (it inserts its own invisible line break characters). The fix will be to use mac2unix or another method to convert to unix-style line breaks... you can search the forum archive for similar solutions ("mac2unix" is probably a good keyword to start with)

Good luck!

system · May 3, 2021, 9:31pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.