Importing Error: is not a(n) QIIME1DemuxFormat file for fasta file

Patrick · July 29, 2019, 9:42pm

I am trying to use vesearch de novo otu picking in qiime2 to create a reference database of v4_16S rRNA reads. I want to gather all unique reads at 99% ID. I have 3158266 reads after QC from all my samples so its taking very long time to run.

So to reduce the initial data step I first ran vsearch open otu picking on each individual sample. I then exported the reprsentive sequences from each sample and concatenate these to run vsearch open otu picking for a second round.

However when I try to import this fasta file back to a qza I receive an error:
"There was a problem importing PensacolaCat_round1_dn99.fa.mod3.fa:

PensacolaCat_round1_dn99.fa.mod3.fa is not a(n) QIIME1DemuxFormat file"

hear is the head and tail of my input file

$ head PensacolaCat_round1_dn99.fa.mod3.fa 
>PensacolaCatRound1dn99.1_1
CGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGGTGACTAAGTCGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCGTCCGATACTGGTCGCCTAGAGTATGGAAGAGGGAAGCGGAATTCCAGGTGTAGCGGTGAAATGCGTAGATATCTGGAGGAACATCAGTGGCGAAGGCGGCTTCCTGGTCCAATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACA
>PensacolaCatRound1dn99.2_1
GGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGTTAAGTCAGATGTGAAAGCCCCGGGCTCAACCTGGGAATTGCATTTGAAACTGGCAAGCTAGAATGCAGTAGAGGGAGGTGGAATTTCCGGTGTAGCGGTGAAATGCGTAGAGATCGGAAGGAACACCAGTGGCGAAGGCGGCCTCCTGGACTGACATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACA

$ tail PensacolaCat_round1_dn99.fa.mod3.fa 

GAGCGAACG
>PensacolaCatRound1dn99.610641_1
CAACCCTGGGACGCCACCTGATACTGCCGTGACTGGAGTCCGGTAGAGGAGCGTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCTATAGCG
>PensacolaCatRound1dn99.610642_1
ACCTAGGAAGTGCACTCGAAACTGCCTCGCTGGAGTGCCGGAGAGGAAAGCGGAATTCTCGG
>PensacolaCatRound1dn99.610643_1
GGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGGGTGCGTAGGTTGTTTATTGAGTCACATGTGAAATTCCCGGGCTTAACCTGGGCATGTCATGTGATACTGATAGACTTGAGTATGGGAGAGGGCAGTGGAATTTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCGGTGGCGAAGGCGGCTGCCTGGCCTGATACTGACACTG

jwdebelius · July 30, 2019, 7:48am

Hi @Patrick,

How are you importing your sequences? They look like a FeatureData[Sequence] semantic type to me. Have you tried importing them that way? There are a fair number of good tutorials on importing sequences, so maybe have a look?

Best,
Justine

Patrick · July 30, 2019, 12:57pm

After I exported the representative sequences from each sample and concatenate these, I can import as a FeatureData[Sequence]. The problem is that vsearch does not take FeatureData[Sequence] as an input format. So when I run:

qiime vsearch dereplicate-sequences --i-sequences PensacolaCat_round1_dn99.qza --o-dereplicated-table table.qza --o-dereplicated-sequences rep-seqs.qza

I get this error:

(1/1) Invalid value for "--i-sequences": Expected an artifact of at least
type SampleData[Sequences] | SampleData[SequencesWithQuality] |
SampleData[JoinedSequencesWithQuality]. An artifact of type
FeatureData[Sequence] was provided.

I thought I could just edit the fasta headers from this >cf5c43911866d34cdb99ff57b6dbce4bc1a3 to this >PensacolaCatRound1dn99.1_1 to trick qiime2 into thinking I provide a SampleData[Sequences], but the import tool saw through my deception and gave me the error which I first posted with "PensacolaCat_round1_dn99.fa.mod3.fa is not a(n) QIIME1DemuxFormat file".

jwdebelius · July 30, 2019, 1:11pm

Hi @Patrick,

Maybe I’m missing something, but is there a motivation for renaming the centroid from the hash for this application? If you just merge your sequences with feature-table merge-seqs, I think should dereplicate because your hash should be*** consistent across all your OTUs. So, then, you’ll have a consistent reference set.

A second question… why not use a denoiser? Then, you could avoid the picking and re-picking?

Best,
Justine

**For almost all sequences.