Import fasta files without quality

hdoris · November 1, 2024, 7:42pm

Hey Qiime2 community!

I am currently trying to compare samples from several papers and they all make their data publically available in different formats. I have been able to navigate most of the imports but a few papers post their data so that it is in fasta files and not fastq files. They will post each sample as a different fasta file. For example there are 41 different samples and it is Sample01.fasta, Sample02.fasta, Sample03.fasta .... Sample41.fasta. I have found this forum issue but it doesn't give a solution, just that is was solved:

I have tried to import the data using a conda installed Qiime2 version 2024.5 by putting all the samples into one directory (Seq_files) and then use the command:

qiime tools import --input-path Seq_files/ --output-path seqs.qza --type 'SampleData[Sequences]'

It does not successfully run and I get the error:

There was a problem importing Seq_files/:

Missing one or more files for QIIME1DemuxDirFmt: 'seqs.fna'*

How can I import mulitple fasta files that do not contain quality?

Another issue I run into is importing a fasta file that does not contain quality and has a forward and reverse fasta file. So I have Forward.fasta and Reverse.fasta for one sample. There is no quality included. I have tried using a manifest.tsv file for this to see if that would work but I keep getting an error. I run the same command as above and get a similar error:

qiime tools import --input-path manifest.tsv --output-path seqs.qza --type 'SampleData[Sequences]'

It does not successfully run and I get the error:

There was a problem importing manifest.tsv:

manifest.tsv is not a(n) QIIME1DemuxFormat file

Is there a way to import data without quality both after it is already merged and before? Any advice would be greatly appreciated!

Thanks!

colinvwood · November 4, 2024, 6:55pm

Hello @hdoris,

Have you looked at the FASTA file importing docs?

hdoris · November 14, 2024, 8:29pm

Hey @colinvwood

So I have tried to look through the importing docs but ran into a few problems. One is all of my sequences had already been demultiplexed, so they were already separated into individual samples. I have since combined them into one sequencing file but am still getting an error because my sequences without quality seem to be the wrong format. But I am not entirely sure why. This is what my sequence file looks like:

S7.W2_0 M00232:79:000000000-D0MPC:1:1101:14471:1401 1:N:0:0 orig_bc=TCCGACACAATT new_bc=TCCGACACAATT bc_diffs=0
TACGTAGAGTGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGATAGT
AAAGTCAAGCGTGAAAGGTGTCGGCTTAACCGACAGACTGCGTTTGAAACTGATTATCTT
GAGTGTAACAGAGGAGAGTGGAATTCCTAGTGTAGTGGTGAAATACGTAGATATTAGGAA
GAACACCAGTGG
S7.W1_1 M00232:79:000000000-D0MPC:1:1101:14210:1413 1:N:0:0 orig_bc=TGAGTCACTGGT new_bc=TGAGTCACTGGT bc_diffs=0
TACGTAGAGTGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGATAGT
AAAGTCAAGCGTGAAAGGTGTCGGCTTAGCCGACAGACTGCGTTTGAAACTGATTATCTT
GAGTGTAACAGAGGAGAGTGGAATTCCTAGTGTAGTGGTGACATACGTAGATATTAGGAA
GAACACCAGTGGCGA
S6.W1_83556 M00232:25:000000000-D098G:1:1102:15730:29303 1:N:0:0 orig_bc=TAGGATTGCTCG new_bc=TAGGATTGCTCG bc_diffs=0
TACGTAGAGTGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGATAGT
AAAGTCAAGCGTGAAAGGTGTCGGCTTAACCGACAGACTGCGTTTGAAACTGGTTATCTT
GAGTGTAACAGAGGAGAGTGGAATTCCTAGTGTAGTGGTGAAATACGTAGATATTAGGAA
GAACACCAGTGGCGAAGGCGACTCTCTGGGTTAACACTGACGCTGAGGTACGAAAGCTGG
GGGAGCAAACG

Do I need to remove the middle code. For example remove 'M00232:25:000000000-D098G:1:1102:15730:29303'?

Lastly, looking through the importing docs I can't seem to find an option for importing a fasta file without quality that has a forward and a reverse. I have two sequence files for each sample, one forward and one reverse. How do I upload these? I can't seem to find an option for this. Does qiime have a process for merging these reads first so that then I can import the sample the same way as above?

Hannah

colinvwood · November 15, 2024, 5:50pm

Hello @hdoris,

Fasta files per sample are not commonly a starting point for analysis, which is why we don't have support for them without some workarounds. @Nicholas_Bokulich pointed out that you could create artificial quality scores to transform these data into fastq files, or you could merge them into a single file so that the single file fasta import works. Both of these are less ideal than starting from scratch with the original fastq files. Do you have access to these?

hdoris · November 15, 2024, 6:15pm

I do not have access to these files. We are doing a large dataset comparison from over 50 different manuscripts and I have reached out to several people to get their raw data reads and have failed with a few. So we are still trying to get these datasets to work. They have deposited sequences without quality and I am hoping I can find a work around with them.

For one manuscript they provide a forward and a reverse fasta file without quality for one sample. I am not exactly sure how to move forward with this sample. I suppose like you mentioned I could give false quality scores? Although not entirely sure how I could do this. Just write a python code that would insert them I suppose?

For the manuscripts that had several samples all in different fasta files I just concatenate them and tried to import them like it says in importing docs. I keep getting an error that says:

sequences.fna is not a(n) QIIME1DemuxFormat file

Is that because my header reads:

S7.W2_0 M00232:79:000000000-D0MPC:1:1101:14471:1401 1:N:0:0 orig_bc=TCCGACACAATT new_bc=TCCGACACAATT bc_diffs=0

and it should read:

S7.W2_0 orig_bc=TCCGACACAATT new_bc=TCCGACACAATT bc_diffs=0

Do I just need to remove that center section?

Thanks so much for the help!

colinvwood · November 15, 2024, 8:58pm

Hello @hdoris,

I see. The issue could indeed be with the header, especially if the two headers are separated by a newline. The first line looks like a typical fastq header. Fasta headers usually begin with a > symbol, though I'm unsure off the top of my head if our fasta formats enforce this. Did the error provide any more detail or a python traceback?

hdoris · November 16, 2024, 1:39am

No the only information it gave for the error was:

There was a problem importing sequences.fna:

sequences.fna is not a(n) QIIME1DemuxFormat file

I will try to remove the middle part of the header to see if that works. The header does contain '>' at the start of each name and I am not sure why it was not seen in the forum post. I might have typed something wrong.

As for the forward and reverse fasta sequences how might I trick qiime2 to make it think it is a fastq file?

ebolyen · November 20, 2024, 7:07pm

It does seem the code is pretty light on the explanation when it errors:

github.com

qiime2/q2-types/blob/dev/q2_types/per_sample_sequences/_formats.py#L406-L448


      
          def _validate(self, filehandle, *, num_records):
              ids = set()
              for (header, seq), _ in zip(itertools.zip_longest(*[filehandle] * 2),
                                          range(num_records)):
                  if header is None or seq is None:
                      # Not exactly two lines per record.
                      raise Exception()
          
                  header = header.rstrip('\n')
                  seq = seq.rstrip('\n')
          
                  id = self._parse_id(header)
                  if id in ids:
                      # Duplicate header ID.
                      raise Exception()
          
                  self._validate_id(id)
                  self._validate_seq(seq)
          
                  ids.add(id)

This file has been truncated. show original

I actually don't think it's the middle section necessarily, as it uses a str.split() which should ignore everything after the first space.

It's possible there's a duplicate ID, or the number after the first _ isn't increasing to keep the reads unique.

I think the goal should be to use vsearch dereplicate-sequences which accepts SampleData[Sequences]. Downside, we don't have a method which joins reads without quality scores. So you may need to find a tool to do that.

Sorry I don't know if that's super helpful. I would also support the concept of laundering in quality scores so you can get past the import step. Just don't run anything that would concider them after the fact (i.e. only use dereplicate-sequences and then proceed to your strategy in the other thread)

system · December 22, 2024, 1:08am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.