How to create a dereplicated sequence reference database for taxonomy classification: case of COI

@colinbrislawn you raise some very good points.

The consensus portion of classify-consensus-vsearch is intended to prevent this type of overclassification by finding consensus taxonomies across sequences. How this parameter influences overclassification (and underclassification) is covered in this paper. We only benchmarked 16S and ITS data there, but I would expect similar trends for other marker genes. To state it briefly: setting maxaccepts=1 will almost certainly overclassify, but higher settings of maxaccepts and min-consensus will reduce this risk (and instead tend toward underclassification).

If anything, I think that @devonorourke is at risk of underclassification, since he is setting very high maxaccepts values. It also sounds like he is actually benchmarking these results using a mock community, so should be able to answer this question for us. :smile:

I agree with Robert, overclassification is a big and overlooked issue. We optimized the methods in QIIME 2 to minimize overclassification, but these will need to be re-tested for COI and other marker genes.

@devonorourke's problem is a little different, though — as I understand it, he is still at the stage of database construction, which is a formidable but non-novel issue, hence my suggestions to follow in the footsteps of SILVA/greengenes developers. Issues of overclassification will come up later when he reaches the classification step.

The general solution

SILVA taxonomies are formatted so that seq replicates with divergent taxonomies use either the consensus or the majority. Greengenes uses consensus (I believe). So

>seq1;k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__;g__;s__
AATTCCGG
>seq2;k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__
AATTCCGG
>seq2;k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__
AATTCCGG
>seq3;Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Pterostichus;s__Pterostichus tristis
TAGTAGTA
>seq4;k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Heteropaussus;s__ Heteropaussus hastatus
TAGTAGTA
>seq4;k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Heteropaussus;s__ Heteropaussus hastatus
TAGTAGTA

Would become:
majority:

AATTCCGG   k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__Oecophoridae;g__Chezala;s__
AATTCCGG   k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__Heteropaussus;s__ Heteropaussus hastatus

consensus:

AATTCCGG   k__Animalia;p__Arthropoda;c__Insecta;o__Lepidoptera;f__;g__;s__
AATTCCGG   k__Animalia;p__Arthropoda;c__Insecta;o__Coleoptera;f__Carabidae;g__;s__