Error downloading ITS sequences from NCBI

Dear Professor Robson, I have studied the tutorial you published about Using RESCRIPT's' extract seq segments' to extract reference sequences without PCR prime pairs。However, the following error occurred when extracting nucleotide sequences in NCBI:
WARNING:2023-05-09 09:19:09,123:MainProcess:Using pdb|7UQZ| 6 as a sequence identifier,because it did not come down with an accession version

Will this situation affect my annotations? Because my fungi annotation results also show that many ASVs can only be annotated to the boundary level and cannot obtain more accurate annotations.

Here is the original code I ran:
qiime rescript get-ncbi-data
--p-query "txid4751[ORGN] AND (ITS1 OR ITS2 OR its1 OR its2) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]"
--p-ranks kingdom phylum class order family genus species
--p-rank-propagation
--p-n-jobs 60
--o-sequences ITS-ref-seqs.qza
--o-taxonomy ITS-ref-tax.qza
--verbose

I sincerely hope to receive your help and thank you for responding to my letter amidst your busy schedule!

The identifier "pdb|7UQZ| 6" appears to be for an amino-acid sequence not nucleotide sequence. Are you sure you're using the correct file? Did you run rescript get-ncbi-data-protein by mistake?

What puzzled me was that the code I entered was for extracting nucleotides, but the error message was for amino acids.

Here is the code I ran:

qiime rescript get-ncbi-data
--p-query "txid4751[ORGN] AND (ITS1 OR ITS2 OR its1 OR its2) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]"
--p-ranks kingdom phylum class order family genus species
--p-rank-propagation
--p-n-jobs 60
--o-sequences ITS-ref-seqs.qza
--o-taxonomy ITS-ref-tax.qza
--verbose

Hi @towns,

No need to paste the command again. :slight_smile:

I ran your command myself and also saw the same error messages. However, the command did complete and output files were generated. The message is just telling you that it is using something else for the identifier. But it appears that all (most?) of the data are being downloaded.

I searched the output FASTA file for anything containing pdb| and then manually ran BLAST on these sequences. They all appear to contain ITS sequences.

I'll confer with my colleagues about the best way to resolve more informative identifiers. Again, these appear to look okay, just the IDs look weird.

You should be okay to keep using these sequences as is... But, in case you'd like to remove these oddly-labeled sequences, see the commands I've outlined below. I'd personally keep them, as many may drop out anyway, when performing quality-control and extract-seq-segments, via rescript etc.. If you choose, you can remove anything with pdb| in the FASTA header. To do this we'll need to make a file full of IDs we wish to remove..

  1. First, let's export the FASTA file:
qiime tools export \
    --input-path ITS-ref-seqs.qza \
    --output-path ITS-ref-seqs-export
  1. Make empty file with column header:
echo 'feature-id' > ids-to-remove.txt
  1. Search for unwanted pdb| identifiers in the exported FASTA file, remove the > from the identifier, and append the rest to the ids-to-remove.txt file:
grep 'pdb|' ITS-ref-seqs-export/dna-sequences.fasta | sed 's/^>//' >> ids-to-remove.txt
  1. Remove the unwanted sequences:
qiime feature-table filter-seqs \
    --i-data ITS-ref-seqs.qza \
    --m-metadata-file ids-to-remove.txt \
    --p-exclude-ids \
    --o-filtered-data ITS-ref-seqs-filt.qza
  1. Remove the unwanted taxonomy:
qiime rescript filter-taxa \
    --i-taxonomy ITS-ref-tax.qza \
    --m-ids-to-keep-file ITS-ref-seqs-filt.qza \
    --o-filtered-taxonomy ITS-ref-tax-filt.qza

Now you should be able to build your classifier.

1 Like

I appreciate your help and your patience,thanks!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.