I'm working with a 16S sequencing dataset of mice faeces (miseq paired end, 2x300bp,V3-V4 region). I did the taxonomic assignation with silva-132-99-515-806-nb-classifier, and I found a lot of unnasigned reads (in some samples more than 90% of all reads, I attached a txt file table-taxa.txt (927.9 KB) .
I inspected a few of that reads and I found that:
are shorter than espected (about 200-300 nt, when 425 is the expected number after merging f and r).
they belonged to mice DNA, according to blast.
I don't know how to handle this. I have two main questions:
How can I filter out those mice sequences? I used DADA2 to quality filtering (trunc-len-f 273 --p-trunc-len-r 220 --p-trim-left-f 19 --p-trim-left-r 22).
Is normal that 90% of my sequences comes from mice or could come from an error during library preparation?
You can use qiime taxa filter-table to remove all unknowns.
Alternatively, use qiime quality-control exclude-seqs to eliminate sequences that do not match your reference sequences within some % similarity.
90% seems unusually high. Mitochondrial DNA will be amplified by these primers, but 90% still seems extremely high for feces. Is NCBI BLAST saying these are mitochondrial or other sequences? If non-mitochondrial, you may want to chat with your sequencing center to figure out why so much non-target DNA is being included. This could be an error during library preparation.
I am not sure where this is coming from — 16S rRNA gene primers probably should not be amplifying chromosomal DNA from a mouse, but depending on what primers you are using I suppose it is possible. I wonder if cross-contamination could be at fault (e.g., from other experiments of RNAseq preps of mouse DNA?)… but first thing you should do is check your primers. Next talk to your sequencing facility…