File is not a(n) DNAFASTAFormat file

borgesrodrigo · March 22, 2018, 12:47pm

Hi,

I am trying to import the human genome file into a qiime2 artifact and I am having

"is not a(n) DNAFASTAFormat file".

I have already translated lower case to upper case and I don´t know what else could be. It is hg18.

Chr names seems healthy for me:

ChrM
Chr1
Chr2
Chr3
Chr4
Chr5
Chr6
Chr7
Chr8
Chr9
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
Chr16
Chr17
Chr18
Chr19
Chr20
Chr21
Chr22
ChrX
ChrY

Any help would be appreciated.

Rodrigo

borgesrodrigo · March 22, 2018, 3:33pm

I forgot to mention that I am using v2018.2

Thanks!

ebolyen · March 22, 2018, 9:17pm

Hi @borgesrodrigo,

Would you be able to post a sample of your file?

Also, what kinds of analysis are you looking to do with human genomes? You might be better off starting with a feature-table or similar, as the amplicon-processing steps may not apply here (depending on what your data is).

borgesrodrigo · March 22, 2018, 11:46pm

Hi @ebolyen,

I want to filter sequences that matches the human genome. I am trying to use this : exclude-seqs: Exclude sequences by alignment — QIIME 2 2018.2.0 documentation

ChrM
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT
GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC
CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT
TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTAACACA
TTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATACAACCCCC
GCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAAC

I know that there are N's too, and it is all upper case.

thank you

colinbrislawn · March 23, 2018, 6:05pm

Hello Rodrigo,

Sounds like you are working with shotgun genomic data. (Amplicon data would should not have any human hits because the 16S PCR primers only target bacteria.)

While filtering out human data may be possible inside Qiime 2, it might be easier / faster to filter your data with another program, then import the filtered data into qiime. For example, you could use BBDuk2 to select just the 16S reads from your full data set, or you could use Knead Data to filter out matches to the human host.

This would make a good tutorial or plugin for Qiime 2! Let me know what you find!

Colin

borgesrodrigo · March 24, 2018, 12:14pm

Hi @colinbrislawn, thank you.

Actually it is amplicon data but we had this strange sequencing (FastQC alarm for adaptors) . I aligned the paired-end reads with human genome and had some good alignments, so I wanted to filter those out.

We are still trying to figure it out (how could we amplify human reads) but I wanted to qiime2 to filter those seqs for now.

After your answer I used bash to filter my BAM files and recreate the fastqs without those reads (awk, samtools,bedtools), but if qiime2 could do it it would be much more easier.

Best,
Rodrigo.

Nicholas_Bokulich · March 24, 2018, 12:50pm

@borgesrodrigo,

You are absolutely correct, you can use exclude-seqs to filter out human DNA from your sequences as you describe. This would need to be done on fasta sequences after denoising/otu picking, though, not fastq data.

Human mitochondrial DNA sequences will amplify with 16S primers. Beyond that, I'm not sure why there would be human DNA, but I've seen stranger things in my day .

The issue with your fastas appears to be that the header lines are not formatted correctly, not the presence of Ns. Format like so:

>ChrM
ACTGACTGTGAC
>Chr1
ACTGATCGTAGC
>Chr2
ACAGTGCTGTGA

We already have one

I hope that helps!

borgesrodrigo · March 24, 2018, 1:17pm

Hi @Nicholas_Bokulich ,

I pasted the fasta header but the ">" was interpreted as Mark up.

My FASTA looks like this:

">ChrM
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA"

Nicholas_Bokulich · March 26, 2018, 3:38pm

@borgesrodrigo,
Perhaps line breaks are the issue? A fasta should still be a fasta if the sequence is split over multiple lines, but perhaps you could remove these line breaks (keep line breaks between the headers and sequences) to see if that fixes things...

Alternatively, line breaks or blank spaces at the start/end of the file would cause problems.

If that doesn't work, perhaps you can just post this file here and we can attempt to diagnose on our end.

Thanks!

borgesrodrigo · March 26, 2018, 5:25pm

Ok! I will try that.

Can you point me to the proper python script that does this parsing?

I don't have a proper debugger in this server but maybe it can help me.

Thank you.

ebolyen · March 27, 2018, 4:17pm

Hi @borgesrodrigo,

QIIME 2 uses library code which it builds interfaces out of, so there aren't any scripts which you can use. But it wouldn't be hard to make one!

For example, this very quick and dirty script should give you some ideas, just replace the filepath with your own.

#!/usr/bin/env python
import skbio.io
try:
   for num, seq in enumerate(skbio.io.read("path/to/your/file.fasta", format='fasta')):
       pass  # we're looking for broken records, if we made it here, everything is fine
except:
   print("Failed on record: %d" % (num + 1))
   raise

(scikit-bio is already installed in your QIIME 2 environment)

borgesrodrigo · March 29, 2018, 12:10pm

Great! I will try it.

thank you