Extracting matching reads from PR2 database fasta based on BLAST6 output

RielAlfonso · February 8, 2024, 2:25pm

Hello, I've ran a blast classification with the PR2 database with the following commands:

qiime tools import
--type 'FeatureData[Sequence]'
--input-path /home/phil/Analysis/databases/pr2_version_5.0.0_SSU_taxo_long.fasta
--output-path pr2_blastdb.qza

qiime feature-classifier makeblastdb
--i-sequences pr2_blastdb.qza
--o-database pr2_blastdb_made.qza

qiime feature-classifier blast
--i-query bora_dada2_ccs_rep_nosing.qza
--i-blastdb pr2_blastdb_made.qza
--p-num-threads 16
--p-perc-identity 0.97
--p-maxaccepts 50
--o-search-results bora_ccs_blast_pr2.qza

I have tried multiple approaches to extract the matching reads from the reference pr2_version_5.0.0_SSU_taxo_long.fasta file, to no avail.

I've attempted to extract column 2 and grep across the reference fasta, however, only around 14,000 matches are obtained (my understanding is that all 82,877 blast hits should match the database):

cut -f2 exported/pr2_blast/blast6.tsv | grep -c -f - /home/phil/Analysis/databases/pr2_version_5.0.0_SSU_taxo_long.fasta

The odd bit is that if I check which line is output first using

cut -f2 exported/pr2_blast/blast6.tsv | grep -m1 -n -f - /home/phil/Analysis/databases/pr2_version_5.0.0_SSU_taxo_long.fasta

The output is such:

3:>AB284159.1.1765_U|18S_rRNA|nucleus||Eukaryota|TSAR|Alveolata|Dinoflagellata|Dinophyceae|Peridiniales|Protoperidiniaceae|Protoperidinium|Protoperidinium_bipes

Which is not among the first few entries of the blast6 file as I listed in my examples. It gets even more odd since when I grep the first entry of the blast6 manually, there is a result.

KJ760837.1.1800_U|18S_rRNA|nucleus|clone_SGYW978|Eukaryota|Obazoa|Opisthokonta|Metazoa|Arthropoda|Crustacea|Maxillopoda|Maxillopoda_X|Maxillopoda_X_sp.

Am I missing anything when it comes to grep usage? I've tried listing the accession column into their own file to be used as grep input, with the same result.

I have also tried creating a python script:

import pandas as pd
import concurrent.futures

def process_entry(entry, text_lines, output_file):
for i, line in enumerate(text_lines):
if '>'+entry in line:
output_file.write(lines[i])
if i + 1 < len(lines):
output_file.write(lines[i + 1])

excel_file_path = '/home/phil/Analysis/Boracay/18SFL/exported/pr2_blast/blast6.tsv'
df = pd.read_csv(excel_file_path, delimiter='\t')

text_file_path = '/home/phil/Analysis/databases/pr2_version_5.0.0_SSU_taxo_long.fasta'
output_file_path = 'PR2_reference_seqs.fasta'

with open(output_file_path, 'w') as output_file:
with open(text_file_path, 'r') as text_file:
lines = text_file.readlines()
with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
futures = [executor.submit(process_entry, entry, lines, output_file) for entry in df.iloc[:, 1]]
concurrent.futures.wait(futures)

For a more manual iteration and writing process, but many sequences had NUL, SOH, and other characters mixed in that I am not sure how to fix efficiently.

Is there a better way to extract the blast hits from the reference fasta? I am quite the beginner to all these and am enjoying the tools. Thank you!

Nicholas_Bokulich · February 12, 2024, 1:10pm

Hi @RielAlfonso ,

I suspect that there are only ~14k matches because the ~82k blast hits are to many of the same reference accessions. You could do something like

cut -f2 exported/pr2_blast/blast6.tsv  | sort | uniq | wc -l

to count the number of unique hits and confirm that this matches the number of lines retrieved by grep.

RielAlfonso · February 13, 2024, 8:42am

Hi @Nicholas_Bokulich,

Thanks for the response. That turned out to be exactly the reason why. I was originally expecting the grep search to still keep matches with duplicate hits (which I'd have to de-replicate later on).

system · March 15, 2024, 2:43pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.