RDP Reference Database in QIIME2 format

Nicholas_Bokulich · June 3, 2020, 4:06pm

They do not look numeric, so assuming all follow that format that is not the issue.

No, sounds like it worked! You have a related (but new) error, so that means that the windows-style line breaks were the issue before.

That's right. This error is specifically saying that those IDs are in the sequences but not the taxonomy.

probably because those IDs are no longer included in the top matches; they may be excluded by extract-reads (e.g., if they are too short/long) or trimming off the rest of the sequence alters the kmer profile (which vsearch uses to queue up the first N hits for alignment).

This command might help you find all IDs that are unique in one file or another (no guarantees this will work, I'm sort of ad-libbing based on the file snippets you shared above):

grep '^>' rep_set_99_rdp.fa | tr -d '>' | cat - rdp_qiime_taxonomy.txt | grep -v ';' | sort | uniq -u

Then you can confirm that those IDs are really missing from one file but not the other like this (you only need to run this once or twice, just to make sure the command above worked):

id='paste-the-id-here'
for f in rep_set_99_rdp.fa rdp_qiime_taxonomy.txt
do
  grep $id $f | wc -l
done

Want to give that a spin and see how many IDs are missing?