Thank you for sharing the files @Nisha.
Based on the Sample IDs, it appears you are trying to re-analyze data from this BioProject? After looking at the amplicon samples within the BioRpoject, and the rep-seqs file you shared, I noticed that the V3 primers were not trimmed prior to denoising with DADA2.
That is, I am assuming these (or highly similar) primers (Huse et al. 2008) were used:
If so, run cutadapt trim-paired, with the above primer sequences. I'd also make sure that the following options are enabled
--p-match-adapter-wildcards. You can search the forum for related posts that explain why. Especially, in reference, as to why the primers should be removed.
Also, it appears that you may have to alter the
--p-perc-identity and possibly
--p-query-cov . I tried setting
--p-perc-identity 0.95 and that seemed to work well, while discarding slightly more sequences. If you want to be a bit more restrictive and truly do not care about Eukaryotes, then you can consider downloading and importing one of the GreenGenes reference databases from the Data resources page, and use that as the reference for
However, I'd try the following approach first...
I do find it very odd that ~15 samples appear to be comprised mostly of mixed oriented reads. That is they are in fact not Eukaryotes, but other microbial taxa. Again, they are being mistakenly classified as Eukaryotes via sklearn as it is searching the for a read that is not in the same orientation of the database. Perhaps the forward and reverse reads where swapped by accident upon upload to GenBank?
One easy solution to this is to modify your Manifest file by swapping the file-paths for the forward and reverse reads for these samples. This should result in the reads to being imported in the correct orientation. After importing in this way you should be able to proceed with cutadapt, denoising, and then taxonomy assignment.
Give this a try and let us know how it goes!