Hello, I've ran a blast classification with the PR2 database with the following commands:
qiime tools import
--type 'FeatureData[Sequence]'
--input-path /home/phil/Analysis/databases/pr2_version_5.0.0_SSU_taxo_long.fasta
--output-path pr2_blastdb.qza
qiime feature-classifier makeblastdb
--i-sequences pr2_blastdb.qza
--o-database pr2_blastdb_made.qza
qiime feature-classifier blast
--i-query bora_dada2_ccs_rep_nosing.qza
--i-blastdb pr2_blastdb_made.qza
--p-num-threads 16
--p-perc-identity 0.97
--p-maxaccepts 50
--o-search-results bora_ccs_blast_pr2.qza
I have tried multiple approaches to extract the matching reads from the reference pr2_version_5.0.0_SSU_taxo_long.fasta file, to no avail.
A few examples here are the first entries of column 2 in the BLAST 6 for reference (82,877 entries): KJ760837.1.1800_U|18S_rRNA|nucleus|clone_SGYW978|Eukaryota|Obazoa|Opisthokonta|Metazoa|Arthropoda|Crustacea|Maxillopoda|Maxillopoda_X|Maxillopoda_X_sp
KJ760835.1.1800_U|18S_rRNA|nucleus|clone_SGYW974|Eukaryota|Obazoa|Opisthokonta|Metazoa|Arthropoda|Crustacea|Maxillopoda|Maxillopoda_X|Maxillopoda_X_sp
KJ760810.1.1799_U|18S_rRNA|nucleus|clone_SGYW918|Eukaryota|Obazoa|Opisthokonta|Metazoa|Arthropoda|Crustacea|Maxillopoda|Maxillopoda_X|Maxillopoda_X_sp
I've attempted to extract column 2 and grep across the reference fasta, however, only around 14,000 matches are obtained (my understanding is that all 82,877 blast hits should match the database):
cut -f2 exported/pr2_blast/blast6.tsv | grep -c -f - /home/phil/Analysis/databases/pr2_version_5.0.0_SSU_taxo_long.fasta
The odd bit is that if I check which line is output first using
cut -f2 exported/pr2_blast/blast6.tsv | grep -m1 -n -f - /home/phil/Analysis/databases/pr2_version_5.0.0_SSU_taxo_long.fasta
The output is such:
3:>AB284159.1.1765_U|18S_rRNA|nucleus||Eukaryota|TSAR|Alveolata|Dinoflagellata|Dinophyceae|Peridiniales|Protoperidiniaceae|Protoperidinium|Protoperidinium_bipes
Which is not among the first few entries of the blast6 file as I listed in my examples. It gets even more odd since when I grep the first entry of the blast6 manually, there is a result.
grep 'KJ760837.1.1800_U|18S_rRNA|nucleus|clone_SGYW978|Eukaryota|Obazoa|Opisthokonta|Metazoa|Arthropoda|Crustacea|Maxillopoda|Maxillopoda_X|Maxillopoda_X_sp' /home/phil/Analysis/databases/pr2_version_5.0.0_SSU_taxo_long.fasta
KJ760837.1.1800_U|18S_rRNA|nucleus|clone_SGYW978|Eukaryota|Obazoa|Opisthokonta|Metazoa|Arthropoda|Crustacea|Maxillopoda|Maxillopoda_X|Maxillopoda_X_sp.
Am I missing anything when it comes to grep usage? I've tried listing the accession column into their own file to be used as grep input, with the same result.
I have also tried creating a python script:
import pandas as pd
import concurrent.futures
def process_entry(entry, text_lines, output_file):
for i, line in enumerate(text_lines):
if '>'+entry in line:
output_file.write(lines[i])
if i + 1 < len(lines):
output_file.write(lines[i + 1])
excel_file_path = '/home/phil/Analysis/Boracay/18SFL/exported/pr2_blast/blast6.tsv'
df = pd.read_csv(excel_file_path, delimiter='\t')
text_file_path = '/home/phil/Analysis/databases/pr2_version_5.0.0_SSU_taxo_long.fasta'
output_file_path = 'PR2_reference_seqs.fasta'
with open(output_file_path, 'w') as output_file:
with open(text_file_path, 'r') as text_file:
lines = text_file.readlines()
with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
futures = [executor.submit(process_entry, entry, lines, output_file) for entry in df.iloc[:, 1]]
concurrent.futures.wait(futures)
For a more manual iteration and writing process, but many sequences had NUL, SOH, and other characters mixed in that I am not sure how to fix efficiently.
Is there a better way to extract the blast hits from the reference fasta? I am quite the beginner to all these and am enjoying the tools. Thank you!