Dealing with unassigned and Kingdom-only assigned features

alfanon · May 29, 2018, 5:03pm

Hi @Mehrbod_Estaki,
No problem, I had not got it. Actually I have got quite intrigued by the issue you linked and I decided to apply the same quality control you suggested as solution in that topic. The reason I tried is the following one: after running dada2 on my data I removed singletons and, based on the taxonomic analysis, I removed as well mitochondrial and chloroplast sequences. After that, following the Filtering tutorial, I noticed in the taxonomy.qza that 167 of my representative sequences were either marked as Unassigned or identified only at the Kingdom level. So I started to think if to remove or not those, maybe they could be contaminants. Actually I blasted all those sequences, and, if many were Uncultured archaea/bacteria/prokaryote, some where Eukaryotes (plants or fungi), some Uncultured organism. So I decided to executed the following command, hoping that it would have removed out of those 167 sequences the Eukaryotes:
qiime quality-control exclude-seqs
--i-query-sequences rep-seqs_nosingletons_nomito_nochloro.qza
--i-reference-sequences Greengenes_13.8/99_otus.qza
--p-method vsearch --p-perc-identity 0.97
--p-perc-query-aligned 0.95 --p-threads 8
--o-sequence-hits hits.qza
--o-sequence-misses misses.qza
I would have expected to have a small number of sequences that don't hit the database, but actually with this approach 2600 out of 7800 sequences were binned as misses, so much more than I expected. Now I have got worried if I should remove or not all of those sequences. Did I do something wrong? Should I really exclude those sequences or simply go on from the point where I stopped, and maybe just remove the Unassigned sequences?
Thanks a lot in advance for any feedback on this
Cheers

Mehrbod_Estaki · May 29, 2018, 7:17pm

Hi @alfanon,

I moved this to its own thread as it was deviating away from the original topic.
Glad you found the link useful!
So there are a few things that may be happening and a few ways to approach this and they really depend on what you are asking from your data. A few questions first:

What percentage of the whole data do those 167 sequences that were Unassigned or at Kingdom only level make? Are they high in abundant across all samples or perhaps they are only appearing in your water only samples again?
Do the length of these reads compare to your expected target size or are they considerably smaller?
Do most of your other reads get taxonomy assignments at a reasonable level lets say at family or genus levels?

The Unassigned reads are probably safe to filter right off the bat as those are not providing you with any useful information, and in all likeliness are contaminants.If you have good taxonomy assignments with most of your other reads and a few are getting Kingdom only, they may be true hits that just don't happen to be known in the greengenes database. For those reads, I would check their length, if they are considerably shorter than what you expect, then I would filter them as well. If they are similar however, then blast a few of them to see their likely source and you can make a decision based on your interest. If your samples are relatively unexplored types then it might be interesting to keep them, if new taxa are not interesting given your experiment, then it might be easier to filter them as well.

alfanon · May 30, 2018, 8:14am

Hi @Mehrbod_Estaki, thanks a lot for helping on this. To answer your questions:

the 167 Unassinged or only at Kingdom level make up 2.1% of the total 7,844 rep seqs I ended up after removing singletons,mitochondrion and chloroplast (corresponding to 13,690 sequences on a total of 9,996,517). I checked and they are distributed more or less across all samples (roughly half of samples don't have such sequences but the other half yes), and they are concentrated in certain samples, which are both the water samples and also normal samples.
2)The length of these rep seqs is quite similar to the expected target size.
3)Most of the other rep seq are assigned to family or genus level.
4)This is the result of the blast of these Unassinged or only at Kingdom level sequences

Picture11598×1078 205 KB

I could filter some of them out based on the blast hits.
Actually what it concerns me the most is the result I have got with "qiime quality-control exclude-seqs". With that, 2600 rep seq were classified as "misses" and therefore should be filtered out. But that's 30% of my rep seqs. Also it is a much higher number of the 167 rep seqs that could be removed based on the taxonomy analysis. I don't understand if I should just think if to remove the Unassinged or only at Kingdom level sequences (based on the criteria you suggested) or if to remove all those 2600 misses. Why so many sequences that were assigned taxonomically are maked as misses?
Thank you for any advice on this.

Mehrbod_Estaki · May 30, 2018, 9:09pm

Hi @alfanon,

Ok this is all looking cool, and thanks for the thorough follow-up and explanations!
First, glad to hear this is only related to 2.1% of your reads, much more of an issue if this was dominating your reads.
The fact that they are spread across your samples and that they are the expected length might indicate that they are true features that are simply poorly defined in the reference database. You might get different results if you were to use a different database like Silva, but I personally don't think it's necessary, especially with blast not being able to give better classification. The Eukaryota hits are a mystery to me but I guess its possible if they share similar regions...its up to your experiment if you want to keep those or not.

I think your main concern of losing too many rep-seqs is related to your filter parameters. The 97% identity and 95% query alignment are pretty strict, in the thread that I discussed using those parameters it was more specific to the issue that I was having, where I was keen on removing all host contaminants. Since that doesn't seem to the be the issue here I would suggest using less stringent values. For example as was suggested on the other thread using a smaller database like gg_88_otus.fasta & reducing the percent identity to 85-90% should reduce the sequences failing to hit. Though it would be worth checking after to make sure you aren't allowing too many false positives. Let us know how it goes!

alfanon · June 1, 2018, 3:41pm

Hi @Mehrbod_Estaki thanks again for your feedback!
That's true, I used very stringent parameters. I tried again with gg_88_otus.fasta as database and reducing the percent identity to 85%, and the number of misses went from 2600 to only 300 hundreds. I ran the command after removing the Unassigned and those only assigned to kingdom level, so there was a pre-filter step. Anyway I blasted all the misses and most of them were bacterial hits or uncultured bacterial hits. Only few were Eukaryotes and so I decided to remove only those. So I think that even using less stringent parameteres as the ones I used in the second try, the quality control using vsearch is removing too many bacterial sequences. I ended up double checking with blast both the Unassigned, the sequences assigned only to Bacteria kingdom and misses from vsearch and I made my decisions based on that. I removed only the sequences that had blast hit to Eukaryota (they were all very solid hits) or no hits to the complete nt database. Several of these were assigned to Bacteria by qiime classifier. Overall I guess it depends on the sensitivity of vsearch vs blast: I did not know much about the differential sensitivity of the two, but it seems that blast is more sensible and can be used to double-check some filtering steps done within qiime, like the quality control based on vsearch or some assignments made by the classsifier of actual fungi/Eukaryota sequences to Bacteria.
Thanks again for helping on this.
Cheers
Niccolò

system · July 2, 2018, 9:41pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.