Wrong 'taxonomy.qzv' file

William · August 30, 2018, 10:20am

I think the issue is an aberrantly short sequence being created during the extract-reads step.

There is only one taxonomy string that matches the one in question:
LAFJ01000960.5.938 D_0__Archaea;D_1__Nanoarchaeaeota;D_2__Nanoarchaeia;D_3__Nanoarchaeales;D_4__Nanoarchaeaceae;D_5__Nanoarchaeum;D_6__hot springs metagenome

And here is the 4 nucleotide sequence that is present for it in my extract-reads artifact file (the original full length read doesn't have any long stretches of N characters, so it appears to be an odd primer match location for the V3V4 primers):

LAFJ01000960.5.938
AAAG

I tried rerunning the qiime feature-classifier extract-reads command with a stricter setting of --p-identity 0.90 (default is 0.80). This resulted in the >LAFJ01000960.5.938 sequence not showing up in the artifact file (the total reads went from ~735K to ~724K). This may be a potential solution, can you try this approach and see if it resolves the taxonomy issues?

Perhaps there needs to be a minimum/maximum length setting (there's a read with 1878 BP, >AB302407.1.2962, in the .80 output artifact too) for extract-reads? An alternative approach I've taken to extracting reads between priming sites is to take the mode of the position where the primer matches in the (non-destructive!) alignment to slice out the region between the primer binding sites, and degap these reads to avoid issues with strange binding sites for poorly-matching primers. Of course, issues with the alignment itself could introduce other problems, so there may not be a perfect solution.