More FASTA import issues

Hello,
I’ve sifted through a lot of similar topics regarding this but still seem to be having some issues - apologies if this has indeed been covered though!

I have a ~4gb fasta produced by a company who carried out 16s sequencing for us - from what I can tell this has already been through a number of pre-processing steps - barcodes have been trimmed, I presume some quality filtering has been carried out and reads have been assigned sample IDs, the header looks like this:

>100_3
TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAAGGAAGAAGTATCTCGGTATGTAAACTTCT
ATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGG
GGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGCGGCGGAGCAAGTCAGAAGTGAAAGCCCGGGGCTCAA
CCCCGGGACGGCTTTTGAAACTGCCCTGCTTGATTTCAGGAGAGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTA
GATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACTGACAATGACGCTGAGGCTCGAAAGCGTGGGGAGCAAA
CAGG
>100_2
TGGGGAATTTTGCGCAATGGGGGAAACCCTGACGCAGCAACGCCGCGTGCGGGACGAAGGCCTTCGGGTTGTAAACCGCT
TTCAGCAGGGAAGAACCGAGACGGTACCTGCAGAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGG

This makes sense, “100” is one of our sample IDs and I presume the other numbers are read numbers. However when I try to import this file into QIIME2 as a SampleData[Sequence] artefact I get an error about the data not being a valid qiime1 demux format. Looking at the specs for that format, I can see why - my question is: is there a way to modify this fasta so it is suitable for QIIME2 or do I need to go back to the raw data (which may or may not be possible!)? My plan was to import this into QIIME2, dereplicate with vsearch and then carry on with the standard workflow. QIIME2 will take this file as FeatureData[Sequence] but I think that’s incorrect as this is not representative data but essentially “raw”.

Thanks for your help!

1 Like

Hey @stuartastbury,

Actually your posted snippet looks like it should work for the QIIME1DemuxFormat, it’s kind of strange that it’s not working. The rules for the format are here.

Since this format isn’t using our newer validate API, there’s not a lot QIIME 2 can do to explain why its not working, but I could try testing it out with a subset of the data if you were able to provide that (via DM is fine).

That is correct.


Of course, if you are able to track down the raw data, that would be quite a bit better as you could use a denoiser like DADA2.

Thanks for the help @ebolyen!

I’ve done a bit more digging and checked for duplicate sequence IDs, headers with no sequence data, even lowercase acgt and none of these turned up anything! Happy to DM a subset of reads - let me know how many and I can send a link. Presuming a random sample is the best way?

Hey there @stuartastbury!

I don't know if that is necessary - it looks like the following format rule is being broken here:

The examples you provided above appear to split the sequence over multiple lines (at least as formatted here on the forum). If that is the case in the source file too, then that is likely what is going wrong here. If you concatenate those lines so that the sequence is only spanning one line you should be good to go. Keep us posted! :qiime2: :t_rex:

Hi @thermokarst!

Should have listed that above - I did think that might be the issue but I’ve tried an awk one-liner to hopefully remove line breaks (https://www.biostars.org/p/9262/) and that doesn’t seem to do the trick, (but there may be better ways of achieving this?!). Hopefully not sounding too dim here but how can I tell e.g in the Terminal if my sequence is “really” split over multiple lines or if it is just being displayed that way?

cat -e path/to/seqs.fasta

This will show all non-printing whitespace, as well as a $ at the end of each line.

1 Like

Knew it had to be something simple, looks like my messing around with awk was not doing the trick, seqtk fixed the line breaks, so now I’m rolling! Thanks a lot!

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.