Importing FASTA sequences for OTU picking/taxonomy assigning

adoloman · July 2, 2018, 12:01am

Hi there!
I am trying to import a FASTA file with contigs of paired sequences and to do taxonomy assignment on them.
Been trying to use VSEARCH command, but no success. When typed this command:

qiime tools import
--input-path sequencesMG.fasta
--output-path seqsMG.qza
--type 'SampleData[Sequences]'

Got an error:

There was a problem importing sequencesMG.fasta:
sequencesMG.fasta is not a(n) QIIME1DemuxFormat file

Is there another way how to import FASTA sequences and get the associated table.qza with it?

Another way I tried was in this command:

qiime tools import
--input-path otus_rep.fasta
--output-path sequencesMG.qza
--type 'FeatureData[Sequence]'

And then tried to do the taxonomy assignment with:

qiime feature-classifier classify-sklearn
--i-classifier gg-13-8-99-515-806-nb-classifier.qza
--i-reads sequencesMG.qza
--o-classification taxonomyMG.qza

which worked, BUT, i cannot proceed to the next step, to actually visualize my assignments:

qiime taxa barplot
--i-table table.qza
--i-taxonomy taxonomyMG.qza
--m-metadata-file metadata.tsv
--o-visualization taxa-bar-plotsMG.qzv

Since I do not have table.qza

Anybody faced the same trouble?

thermokarst · July 2, 2018, 10:49pm

Hey hey @adoloman!

Does this FASTA file contain the feature counts in it? Can you provide a sample of the first few lines of the fasta file you are working with?

It seems like there are a million variations of FASTA-formatted data --- this particular type that this import command is expecting is based on the QIIME 1 seqs.fna format, where all of the reads are multiplexed into one file. This format specifies the sample ID & the read ID into the FASTA record headers:

>sample-a_read0001
AAACCCGGGTTT
>sample-a_read0002
AAGGCCTTAACC
>sample-b_read0003
GGGGGGTTTTTT
>sample-b_read0004
CCTATTTTTTTT

Going back to my question above about if your FASTA files have feature counts in it --- this seqs.fna format does have that data, albeit indirectly (sequences might be the same across multiple reads --- these would be dereplicated as n+1 counts in the feature table generated).

Anyway, a little preview of what you're working with would go a long way! Standing by!

:qiime2:

adoloman · July 2, 2018, 11:08pm

Hi @thermokarst!
So the fasta file I am working with looks like this:

"> denovo0 gb.w.V_13558
CCTACGGGGGGCAGCAGTGAGGAATATTGGTCAATGGGCGCAAGCCTGAACCAGCCATCCCGCGTGAAGG
"> denovo1 agb.w.V_106381
CCTACGGGTGGCAGCAGTAGGGAATATTAGAAATGGACGAAAGTCTGATCTAGCAACACCGCGTGTGCGAAG
(added "" to present > signs)

And there is a space between "denovoX" and "gb...".

thermokarst · July 3, 2018, 10:34pm

Awesome, thanks for the sample!

BTW, you can just wrap the whole thing in triple-backticks to make a code block!

```
code block contents
```

Okay, on to the data...

These identifiers look like feature IDs to me (de novo OTU clustering output). That makes me think that this fasta file is FeatureData[Sequence], which implies that it does not contain the information necessary for determining feature counts. I also don't see any info in the record header to indicate those values either. So this looks to me like a pretty typical "representative sequences" file - the info you are looking for is most likely in the feature table that could (or should) have been produced alongside it.

Taking a step back to the root question here...

I think this is the right import command given the data sample you provided!

This looks really good to me, too.

As you mentioned, you don't have a feature table. If that file isn't available to you from wherever you got this fasta file from, I am afraid there is no way to obtain this info, at least not with the data in-hand.

The best I have to offer is this command:

qiime metadata tabulate \
  --m-input-file taxonomyMG.qza \
  --o-visualization ttaxonomyMG.qzv

This will generate a table of the taxonomy identified (you could also pass in the rep seqs to join the two tables, see this tutorial for an example).

Unfortunately, without a feature table to associate features to samples, there is no way to show which features were present in what abundance, per sample (which is what taxa barplot would provide you).

If you get your handles on a feature table (is there a BIOM table somewhere, maybe TSV formatted?), then you are in business, otherwise I think this is more or less then end of the line here - sorry!

Let us know how it goes! :qiime2:

adoloman · July 3, 2018, 10:52pm

@thermokarst, thanks a lot for a detailed answer!

Indeed this file fasta is a representative sequence file, which I got from a person who hasn't classified it right. Trying to dig into the mistakes.
I did find a biom associated table! What will be my steps to merge those two files and do the taxonomy analysis with the feature-classifier?

thermokarst · July 5, 2018, 2:37pm

:qiime2:

Awesome, thanks for confirming!

Woohoo!

No need to merge! Just import the feature table and then run your taxa barplot command (assuming that the feature IDs match between these two files - they should if they came from the same analysis). Taxonomic analysis doesn't need the observation counts provided by the feature table, which is why you were able to do this taxonomic analysis already with just your rep seqs. You need the feature counts to compute relative abundance and then plot the taxonomic classifications with respect to those abundances.

Keep us posted! :qiime2:

adoloman · July 6, 2018, 8:46pm

Great, it worked! Thanks a lot, @thermokarst!