Trouble importing MegaHit assemblies

Nick_D · September 24, 2019, 7:57am

I recently finished assembling some meta-genomic shotgun sequencing data using MegaHit 1.2.8 and now I'm attempting to import the "final.contigs.fa" files into Qiime2 for analysis, but I just can't seem to get it to work. I've gone through the importing tutorial, and managed to import the files as FeatureData[Sequence], but I cannot import them as SampleData[Sequence] (which is the format I actually need to run the qiime vsearch dereplicate-sequences command).

Here's one of the commands I've tried and the resulting error message:

(qiime2-2019.4) nick@nick-MS-7994:~$ qiime tools import --type 'SampleData[Sequence]' --input-path '/home/nick/SequencingData/MGS-Data/1. Assembled/Raw/N-LLL.fa' --output-path '/home/nick/SequencingData/MGS-Data/1. Assembled/Imported/N-LLL.qza'
Traceback (most recent call last):
File "/home/nick/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/q2cli/builtin/tools.py", line 152, in import_data
view_type=input_format)
File "/home/nick/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/qiime2/sdk/result.py", line 213, in import_data
output_dir_fmt = pm.get_directory_format(type_)
File "/home/nick/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 157, in get_directory_format
% semantic_type)
TypeError: Semantic type SampleData[Sequence] does not have a compatible directory format.

An unexpected error has occurred:

Semantic type SampleData[Sequence] does not have a compatible directory format.

See above for debug info.

I've also tried this command, with the same result:

(qiime2-2019.4) nick@nick-MS-7994:~$ qiime tools import --type 'SampleData[Sequence]' --input-path '/home/nick/SequencingData/MGS-Data/1. Assembled/Raw/N-LLL.fa' --input-format DNASequencesDirectoryFormat --output-path '/home/nick/SequencingData/MGS-Data/1. Assembled/Imported/N-LLL.qza'

I originally thought the issue was due to the file headers having excess data in them other than just <sample-id>_<seq-id>, so I went through and deleted everything except for that; giving headers and sequences like this:

k141_0
GCCAATCCTTACCTTCATAGAACTGGACAGGCTCAAGATCGAGACCTATATACCAGAGGAAGTCGCCTTACACTTATATGATAAAGAAAAACAGAAACTATGTCAAGTAGGAATCCATTTCGATACAGATCCCGAACGGACAATCACCCCTTCCGATTTATATGTGTCTAAAAGTACGACAAACAATAACCTCTCTTATCTACTGACCGCCATTATCCCCAACCCCGATATGGAATGGCTGGGCGGAATGAGTGGAATCCTCTCGATCGACCTACCCAAAGAAGAGCGGTCTCAAAATCTATGGCTTCCCTTAACGGCTATCTGTCATCGCCCCCAAAAAGG
k141_42822
CGGGTACTGTTGTCATGAATGGAAAGCAATATATGTGGAATACATGGGGAGAAATACTGATTCCCGCATCAGATTCGCAGGTTTGGGCGACGTATGCCAATGAATTCTATGAAGGTGGTCCTGCCGTCACGTTCCGCAAGCTGGGCAAAGGCACGGTGACATATGTCGGAGTGGACAGCCATAATGGTGCATTGGAAAAAGATATCTTGAAGAAATTATATGCGCAACTGAGTATTCCCGTTATGGATTTGCCTTATGGAGTTACGGTGGAATACCGGAATGGTTTGGGGATAGTGCTGAATTATGCTGATCGTCCTTATACATTCAACTTACCTGAAGGAAGTAAGGTTTTGATAGGGACGAAAGAGATTCCGACAGCAGGGGTATTGGT

Any help with this would be greatly appreciated.

Nicholas_Bokulich · September 24, 2019, 4:12pm

Welcome to the forum, @Nick_D!

Try FeatureData[Sequence]

Let us know if that works for you!

Nick_D · September 25, 2019, 7:29am

Hi Nicholas,

Thanks for the help, I was able to successfully import the data using FeatureData[Sequence]. Unfortunately, when I try to use “vsearch dereplicate-sequences” it throws the following error:

(1/1) Invalid value for “–i-sequences”: Expected an artifact of at least
type SampleData[Sequences] | SampleData[SequencesWithQuality] |
SampleData[JoinedSequencesWithQuality]. An artifact of type
FeatureData[Sequence] was provided.

Is there any way to import the assemblies as SampleData[Sequences], or convert the FeatureData[Sequences] file to SampleData[Sequences] so that I can dereplicate it?

thermokarst · September 25, 2019, 2:23pm

The megahit assemblies are per-sample, right? You could multiplex all the samples (while modifying the reflines to match the necessary format), then you can import as SampleData[Sequences]. Its a bit of an awkward workflow though, I wonder if there is a better way for you to dereplicate outside of QIIME 2.

Nicholas_Bokulich · September 25, 2019, 3:34pm

Sorry @Nick_D I read too quickly before and now see you already told me that:

Essentially you are trying to run this tutorial, except that you have per-sample fasta, whereas the fasta format expected should be in a single file, rather than a directory of per-sample fasta files. So as @thermokarst advised:

Nick_D · September 26, 2019, 5:35am

Sorry, I don't think I was very clear about the format of the assemblies and it's causing a bit of confusion. I'm trying to import each FASTA file separately rather than as a directory, because they are from different metagenomic samples. Also, each discrete FASTA sample/file has about 50,000 assemblies in it. (See screenshot below)

thermokarst · September 26, 2019, 1:20pm

Hey @Nick_D!

I disagree, I think you explained it very well!

I don't think so, I think @Nicholas_Bokulich & I are on the same page as you (more below).

This is what we mean by "per-sample assemblies" --- you have one file per sample containing the assembly sequences. As I mentioned above, in order to dereplicate in QIIME 2 you will need to multiplex (which is a step in the reverse direction) before importing and dereplicating.

For what its worth, I think in the future it will make sense for us to create a new import format in q2-types to support this schema.

Sample workflow

cd path/to/fasta/files

touch merged.fasta

for f in *.fasta; do fn="${$(basename -- "$f")%.*}"; sed "s/^>\(.*\)$/>\\$fn\_\1/" $f >> merged.fasta; done

qiime tools import \
  --input-path merged.fasta \
  --output-path seqs.qza \
  --type 'SampleData[Sequences]'

qiime vsearch dereplicate-sequences \
  --i-sequences seqs.qza \
  --o-dereplicated-table table.qza \
  --o-dereplicated-sequences rep-seqs.qza \
  --verbose

qiime feature-table summarize \
  --i-table table.qza \
  --o-visualization table.qzv

Nick_D · September 27, 2019, 5:41am

Thank you so much for the help, I really appreciate it! I think I mixed up the terminology and confused myself, I'm glad you guys knew what I was talking about anyway .

I tried the workflow you posted, but when I pasted the code in the terminal it threw this error:

Did I do that correctly?

Nick_D · September 30, 2019, 3:39am

Just in case anyone else runs into the same problem as I did, I found another work-around to get the sequences imported:

Download bbtools (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/reformat-guide/)
Decompress the tool archive.
cd to the main tool directory.
Drag/Drop reformat.sh into terminal.
Type: in=‘InputFileLocation.fa’ out=‘WhereYouWantTheOutputFile.fasta’ fastawrap=10000
Done
The “fastawrap=” needs to be set high enough to not word-wrap any sequences in your .fa file.

It’s a bit of a clumsy (I’m not a programmer) method to fix the Megahit assemblies so that they import as SampleData[Sequence], but it gets the job done.