I recently finished assembling some meta-genomic shotgun sequencing data using MegaHit 1.2.8 and now I'm attempting to import the "final.contigs.fa" files into Qiime2 for analysis, but I just can't seem to get it to work. I've gone through the importing tutorial, and managed to import the files as FeatureData[Sequence], but I cannot import them as SampleData[Sequence] (which is the format I actually need to run the qiime vsearch dereplicate-sequences command).
Here's one of the commands I've tried and the resulting error message:
(qiime2-2019.4) nick@nick-MS-7994:~$ qiime tools import --type 'SampleData[Sequence]' --input-path '/home/nick/SequencingData/MGS-Data/1. Assembled/Raw/N-LLL.fa' --output-path '/home/nick/SequencingData/MGS-Data/1. Assembled/Imported/N-LLL.qza'
Traceback (most recent call last):
File "/home/nick/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/q2cli/builtin/tools.py", line 152, in import_data
view_type=input_format)
File "/home/nick/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/qiime2/sdk/result.py", line 213, in import_data
output_dir_fmt = pm.get_directory_format(type_)
File "/home/nick/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 157, in get_directory_format
% semantic_type)
TypeError: Semantic type SampleData[Sequence] does not have a compatible directory format.
An unexpected error has occurred:
Semantic type SampleData[Sequence] does not have a compatible directory format.
See above for debug info.
I've also tried this command, with the same result:
I originally thought the issue was due to the file headers having excess data in them other than just <sample-id>_<seq-id>, so I went through and deleted everything except for that; giving headers and sequences like this:
Thanks for the help, I was able to successfully import the data using FeatureData[Sequence]. Unfortunately, when I try to use âvsearch dereplicate-sequencesâ it throws the following error:
(1/1) Invalid value for ââi-sequencesâ: Expected an artifact of at least
type SampleData[Sequences] | SampleData[SequencesWithQuality] |
SampleData[JoinedSequencesWithQuality]. An artifact of type
FeatureData[Sequence] was provided.
Is there any way to import the assemblies as SampleData[Sequences], or convert the FeatureData[Sequences] file to SampleData[Sequences] so that I can dereplicate it?
The megahit assemblies are per-sample, right? You could multiplex all the samples (while modifying the reflines to match the necessary format), then you can import as SampleData[Sequences]. Its a bit of an awkward workflow though, I wonder if there is a better way for you to dereplicate outside of QIIME 2.
Sorry @Nick_D I read too quickly before and now see you already told me that:
Essentially you are trying to run this tutorial, except that you have per-sample fasta, whereas the fasta format expected should be in a single file, rather than a directory of per-sample fasta files. So as @thermokarst advised:
Sorry, I don't think I was very clear about the format of the assemblies and it's causing a bit of confusion. I'm trying to import each FASTA file separately rather than as a directory, because they are from different metagenomic samples. Also, each discrete FASTA sample/file has about 50,000 assemblies in it. (See screenshot below)
I don't think so, I think @Nicholas_Bokulich & I are on the same page as you (more below).
This is what we mean by "per-sample assemblies" --- you have one file per sample containing the assembly sequences. As I mentioned above, in order to dereplicate in QIIME 2 you will need to multiplex (which is a step in the reverse direction) before importing and dereplicating.
For what its worth, I think in the future it will make sense for us to create a new import format in q2-types to support this schema.
Sample workflow
cd path/to/fasta/files
touch merged.fasta
for f in *.fasta; do fn="${$(basename -- "$f")%.*}"; sed "s/^>\(.*\)$/>\\$fn\_\1/" $f >> merged.fasta; done
qiime tools import \
--input-path merged.fasta \
--output-path seqs.qza \
--type 'SampleData[Sequences]'
qiime vsearch dereplicate-sequences \
--i-sequences seqs.qza \
--o-dereplicated-table table.qza \
--o-dereplicated-sequences rep-seqs.qza \
--verbose
qiime feature-table summarize \
--i-table table.qza \
--o-visualization table.qzv
Thank you so much for the help, I really appreciate it! I think I mixed up the terminology and confused myself, I'm glad you guys knew what I was talking about anyway .
I tried the workflow you posted, but when I pasted the code in the terminal it threw this error:
Done
The âfastawrap=â needs to be set high enough to not word-wrap any sequences in your .fa file.
Itâs a bit of a clumsy (Iâm not a programmer) method to fix the Megahit assemblies so that they import as SampleData[Sequence], but it gets the job done.