I am using qiime 2 for taxonomic classification of archaea 16 v4-v5 amplicons.
When I used silva 138(including bacteria and archaea) as the reference database, most reads were classified as bacteria. It resulted in less than 2,000 reads kept as archaea in some samples.
When I used silva 138(only including archaea) as the reference database, many reads classified as bacteria before were classified as archaea. And all the samples had more than 10,000 reads as archaea.
I'm confused that which one was the right result?
Hi @mol,
Could you please clarify: do you only expect archaea? why?
Probably the bacteria + archaea database, but it depends a bit on your data and expectations. Let me explain:
Unless if you do not expect any bacteria (e.g., you are analyzing known cultures that should only contain Archaea), you should use the more inclusive database (bacteria + archaea) to avoid misclassifications.
Given the information I have so far, this information seems to suggest that most of your reads are in fact bacterial... but please clarify if there is some reason that you would not expect this.
This database could be misclassifying sequences as Archaea just because the database does not have appropriate outgroups for classification of non-Archaea. Filtering a database to contain only the sequences you are "interested" in can be a dangerous thing indeed, and should only be practiced under specific circumstances, as this paper describes:
So let me answer your title question:
How do I choose the database for the archaea amplicons anotation?
Use the SILVA 138 bacteria + archaea database and it can classify everything in your sample, unless if you know for a fact that only Archaea are present.
Thanksļ¼ I absolutely agree with youļ¼ But the sequencing company advised me to use the SILVA 138 archaea database due to the low proportion of reads classified as archaea by using SILVA 138 bacteria + archaea database.

Given what you have told me so far, that sounds like bad advice. But maybe I am missing something. There are many reasons why you might observe fewer Archaeal reads than expected (primer bias, extraction bias, copy number imbalance?) but removing Bacteria from the database is not a good solution. I have not seen any documented evidence that slashing the SILVA database will improve Archaeal classification, and will be very interested to see such evidence if it exists.
The samples were freshwater. the primers are 524F10extF(TGYCAGCCGCCGCGGTAA) and Arch958RmodR(YCCGGCGTTGAVTCCAATT), which target V4-V5 region. Maybe, primer bias, sample properties or something wrong when sequencing were the reasons? However, we had 200 samples, only eight samples were observed with fewer Archaeal reads.
I am not familiar off-hand with these primers, but much in the way that "bacteria-specific" primers can pick up Archaea, I imagine the reverse is true. It looks like these primers were originally designed for DGGE, and the databases were a lot sparser back then... so it might be worth using qiime feature-classifier extract-reads to see how many bacteria in the SILVA 138 database are hit by these primers.
Index hopping, contamination, etc, could be other issues. You could also look for these primers in the sequences if they are not already trimmed.
Sounds like a problem specific to those 8 samples... I recommend against using an Archaea-only database if only those 8 samples are problematic.
Thanks for your reply. I will resequence these 8 samples, and use the database containing bacteria and archaea.