BLASTing PR2 in Qiime2

Dear All - I’m attempting to BLAST my 18S dada2 results against the PR2 database. This is my understanding so far:

i need to get the PR2 sequences (supplied as a fasta file) and associated taxaonomy (supplied as a text file), as two separate qza files into Qiime2. These PR2 files are located here: https://github.com/vaulot/pr2database/releases. Using qiime import tools and specifying --type FeatureData[sequence] i can import the pr2_version_4.10.0_dada2.fasta into Qiime2 (that said, I’m not sure how to check it). I get error messages (“is not a(n) TSVTaxonomyFormat file” when i attempt to import the pr2_version_4.10.0_merged.tsv as a text file. i note that the merged file contains both the taxonomy and sequences (hence, i assume, ‘merged’). The taxonomy file also has headers.

The advice here: Use PR2 in Qiime suggests that i assess the format of the SILVA text-taxonomy files and reformat the PR2 taxonomy file correspondingly. The response from ‘ygao1’ appears to include importing the same file twice, calling one ‘seqeunce.qza’ and the following one ‘ref-sequence.qza’. A link to a 2nd location is made:https://forum.qiime2.org/search?q=pr2 .However, this 2nd location does not provide much more detail that i can see.

I note that the PR2 database is based on an 8-level taxonomy whereas SILVA is based on a 7-level taxonomy. I can (in R) reformat the PR2 taxonomy text file (with a column each for Kingdom-Species) to combine the taxonomy into a single string (e.g. D_0__Kingdom_D_1__Phylum…) as per SILVA. However, i need to know what the cross-referencing idenifier is between the fasta ‘sequence’ and text ‘taxonomy’ files so that BLAST can make the link. This identifier would logically be the ‘pr2_accession’ number but i’m not sure what the respective fasta and txt file structures should be to ensure this (if correct) works. Guidance and/or sources of information for formating the fasta and text files to enable BLASTing would be much appreciated.

Many thanks.

Hi @TAW,
Sounds like you are on the right track!

Download the greengenes sequences as an example. You will need a sequence file in fasta format, e.g.:

>seq1
ACTGTGTCGA
>seq2
ACTGTCGTGT

and a taxonomy file in tab-delimited format. The taxonomy is semicolon delimited. The first column contains the IDs that match the fasta header ids:

seq1    Bacteria;blah;blah;blah
seq2    Archaea;blah;blah;blah

That does not matter, as long as all taxonomies have the same number of levels (technically you can import and some methods will work with an uneven number of levels, but other methods may fail).

The fasta IDs == the sequence IDs listed in the first column of the taxonomy file

I hope that helps!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.