What sorts of samples are you analyzing? You could be amplifying e.g., plant DNA... I used to run into this problem frequently when using similar primers to examine plant-associated microbial communities.
What taxonomy classification method did you use? Another possibility is that your reads are in the wrong orientation, which will confuse classify-sklearn. Try the BLAST- or vsearch-based classifiers in QIIME 2. If you get the same result, this is probably non-target amplification.
The ITSxpress results you see could indicate either of these problems, but maybe @Adam_Rivers has some more ideas?
ITSxpress is most likely not merging most of your reads because it has pretty high quality thresholds but it is a bit hard to tell from the information I can see in the post.
I’d second the suggestion to Blast a subset of reads and try to get a better sense of what’s happening. Plant contamination seems like the most likely culprit.
I did try again with the eukaryotes UNITE database. Well it is true that all those bastards are plant contamination. Thank you @Nicholas_Bokulich@Adam_Rivers
Well now can I have some advice of how to avoid this situation. How to lower the chance of amplifying plant DNA? Is it all crucial in the library prep step?
You have already used the best method: choose primers that do not amplify plant DNA. ITS1F is supposed to do that, but obviously is not doing its job!
Library prep is where most of this should happen; e.g., if you are able to remove plant matter from your samples prior to DNA extraction, perhaps by rinsing leaves and then filtering.
When I have done plant-associated microbiome work I have just attempted to increase the sequencing depth (i.e., put fewer samples on a single sequencing run) so that I can afford to lose some of my sequences to non-target hits. In some samples I would lose 90% of my sequences! And some samples could not be recovered. But if you have enough non-plant sequences left over you can just proceed with the leftovers.
This may be analogous, but I have this same problem with low-biomass lung samples. To exclude reads from eukaryotic sources, you can do a quality filter step where you essentially blast/vsearch to a taxonomic file (99_otus.txt) from your training set. Then once you generate a hit/misses.qza you can then filter out ALL of the "misses" from your table/sequences.
That method is great for miscellaneous non-target DNA, but I would actually discourage this for ITS data, just because your non-target plant hits are still ITS sequences and you would need to figure out a reasonable threshold of sequence similarity (i.e., how dissimilar plant ITS is from fungal ITS) to use the exclude-seqs method.
Instead, ITSxpress should do a good job of removing most plant reads. Anything that passes you can filter out after taxonomy classification, using qiime taxa filter-table as shown in this tutorial. Something like this: