How to reduce/ eliminate poor feature classification

emmzee · June 11, 2020, 7:18pm

I keep running into an issue with bad taxonomic assignment using a primer pair designed to target COI sequences in fish where a large number of sequences are getting classified as 'cattle' DNA. Based on the environment I sampled it from, I should not be getting any sequences with such taxonomy, which leads me to believe there is an issue with the processing of my sequences within qiime2 or in the laboratory. I can summarize the causes for improperly assigned taxonomy into two groups: 1) Improper use of classifier for query sequences (Example) or 2) result of non-biological sequences (artifacts). However, I wonder if primer design has a part to play in it, or if I ran my samples for many PCR cycles (I used n = 35 cycles).

What can I do to improve the classification observed in my case? Are there any areas where I could improve sequence quality filtration?

Details:

I am running the latest version of QIIME in conda.
I first classified the sequences against a reference COI sequence database I designed for the individuals I am interested in (composed of fish sequences), and then I filtered sequences with 'Unassigned' taxonomy and used a larger COI database (> 1 millions sequences) to BLAST them against.
I ran the following commands:

qiime tools import
--type 'SampleData[SequencesWithQuality]'
--input-path
--output-path
--input-format SingleEndFastqManifestPhred33

qiime cutadapt trim-single
--i-demultiplexed-sequences
--p-front primer sequence (5' forward)
--p-match-read-wildcards
--p-match-adapter-wildcards
--p-discard-untrimmed
--o-trimmed-sequences

qiime dada2 denoise-single
--i-demultiplexed-seqs
--p-trim-left 0
--p-trunc-len 234 (based on results above, maintaining median of 25 Phred score)
--o-table
--o-representative-sequences
--o-denoising-stats

qiime feature-classifier classify-consensus-blast
--i-query
--i-reference-reads
--i-reference-taxonomy
--p-perc-identity 0.95 (but also tried 0.99 and 1.00)
--o-classification

SoilRotifer · June 11, 2020, 9:25pm

Hi @emmzee,

I may need a few more details, but I hope we can help you with the sequence hits.

Is this an eDNA survey? I ask because I have seen this problem with this and other marker genes, like the 12S rRNA gene etc... If you are sampling near or within a river or pond I would not be surprised that you are hitting 'cattle' (potentially feral swine?) or some other ungulate. There are many reasons on why cattle, or other DNA, may actually be there. Including some other reasons, which you've already hit upon.

What fish-primers did you use? Do you know how the primers were validated, i.e. where they checked for specificity against potential off-targets? I ask because many (older) primers were developed for the specific case of obtaining PCR product from tissues in the lab, and not actually validated for eDNA use. Thus, many potential off-targets.

I assume you've manually spot-checked a few of these "off-target" classifications?

-Mike

emmzee · June 11, 2020, 11:04pm

Hi Mike (@SoilRotifer), thanks for your response!

Is this an eDNA survey? I ask because I have seen this problem with this and other marker genes, like the 12S rRNA gene etc… If you are sampling near or within a river or pond I would not be surprised that you are hitting ‘cattle’ (potentially feral swine?)

Indeed, this is an eDNA survey. However, it is very unlikely that the cattle (Bos taurus) DNA I found is representative of true biologically occurring sequences, given the nature of the area I sampled (middle of large lake). Some samples were completely dominated by cattle DNA, making me believe this observation is an artifact.

What fish-primers did you use? Do you know how the primers were validated, i.e. where they checked for specificity against potential off-targets? I ask because many (older) primers were developed for the specific case of obtaining PCR product from tissues in the lab, and not actually validated for eDNA use. Thus, many potential off-targets.

I know the primers were validated and for conditions to use primers in downstream laboratory protocols for specificity using NetPrimer and NCBI's primer BLAST, and then using quantitative real time PCRs. However, I'm not sure if the primers were checked against potential off-targets, as this primer was developed by my colleague. I will try to check the primer against potential off targets and provide more detail (Although I will have to find out how to that first.. ).

SoilRotifer · June 12, 2020, 12:27am

Thanks for all the info @emmzee!

I do not know where this pond is, but it is entirely possible that there is runoff, and fecal pollutants that made their way into the pond either directly or via rain, etc.. Harper et al. 2019, Staley et al. 2018, Thomsen & Willerslev 2015. This funny article from some colleagues.

It is indeed disconcerting to see highly abundant hits to something that "should not be there". Is there a free range near there? Haha, likely not ehh?

Other than that, I agree it could be lab / other contamination as you pointed out. Did you have any positive and/or negative controls in your sequencing run? This may help to see if there is reagent or other contaminations.

it is very unlikely that the cattle ( Bos taurus ) DNA I found is representative of true biologically occurring sequences, given the nature of the area I sampled (middle of large lake). Some samples were completely dominated by cattle DNA, making me believe this observation is an artifact.

I feel your pain. We had intermittent detection issues for one of our diet surveys. Just the general issue of primer design right?

I know the primers were validated and for conditions to use primers in downstream laboratory protocols for specificity using NetPrimer and NCBI’s primer BLAST, and then using quantitative real time PCRs. However, I’m not sure if the primers were checked against potential off-targets, as this primer was developed by my colleague. I will try to check the primer against potential off targets and provide more detail (Although I will have to find out how to that first… ).

You could try this tool from NCBI.

Another option, is to filter your sequences that only have a high match to your reference database either before or after taxonomic classification. Have you looked into the steps outlined here?

I often run something like this:

qiime quality-control exclude-seqs 
  --i-query-sequences query-seqs.qza 
  --i-reference-sequences reference-seqs.qza 
  --p-method blast 
  --p-perc-identity 0.9 
  --p-perc-query-aligned 0.9 
  --o-sequence-hits hits.qza 
  --o-sequence-misses misses.qza

Not sure what else to suggest at the moment. Though I am sure someone else from this large and awesome community will add their thoughts / suggestions.

-Mike

emmzee · June 12, 2020, 2:18am

Thanks again for the quick response, Mike! This is gonna be a long post with a lot of to-do's. I will try to update it as I go through your suggestions. Thanks for your patience and I appreciate your support!

However, I should mention I just realized there's a slight problem with my code. In the –p-front parameter, I used the primer sequence and included a sequencing adaptor (adding an additional 12 nucleotides to the sequence). Do you believe this will have a significant effect? I'm going to re-run my sequences without it and edit this post. Edit: Re-running with primer sequences (adapters excluded) had no effect on the outcome. Although I am using more stringent conditions with dada2 for chimeras now.

I do not know where this pond is, but it is entirely possible that there is runoff, and fecal pollutants that made their way into the pond either directly or via rain, etc… Harper et al . 2019, Staley et al . 2018, Thomsen & Willerslev 2015. This funny article from some colleagues.

I had a quick look at the papers, and in a way I'm happy I'm not the only one encountering this issue. This is making me consider many more factors for my next (microbiota) field work season...

Other than that, I agree it could be lab / other contamination as you pointed out. Did you have any positive and/or negative controls in your sequencing run? This may help to see if there is reagent or other contaminations.

I have not considered looking into the controls. I will check that and see if I can find anything. I also have colleagues who have sequences from the project (including diet surveys) that I could use to compare for the presence of cattle DNA.

You could try this tool from NCBI.

I believe I cannot use that tool as it does not support IUPAC nucleotides. I will look for alternatives. I will also try your other suggestion.

SoilRotifer · June 12, 2020, 12:42pm

Hi @emmzee,

One other thought, assuming it is not contamination or whatever, you may have to look into blocking primers to limit off-targets. We had some varying success using this approach with feral swine, i.e. blocking ~5-45% host COI. I'll go ahead and shamelessly plug this paper too, just so you can see how we approached some other odd findings.

Any-who, leveraging peptide nucleic acid (PNA) is another option. We used this approach for one plant-microbiome study I was involved with, we cut down the plant-host 16S rRNA gene sequences by 50%-80% and greatly increased our access to the microbiota.

Either of these approaches may help with blocking common off-targets. Most of the time people just construct a blocking primer / PNA to inhibit the most common read they get from their pilot sequencing runs... and hope it works.

Best of luck! Please do keep us posted. Hopefully others will qiime-in.

-Mike

dwt · June 12, 2020, 6:41pm

Have you looked at the sequences that are being assigned as cattle and confirmed that that is what they are?
If your primer is fish specific you should have a relatively small number of sequence variants? If so I would export them as a fasta file and run them in a web blast and see how the taxonomy compares to what you're getting.
If you're using a classifier to get the taxonomy you might also want to try using the classify consensus blast method and make sure that they are not very different, which might imply that the issue is with your classifier.

emmzee · June 21, 2020, 11:22pm

sigh. At a big cost —I dropped many, many reads— I fixed the problem. I first wanted to confirm if this was a problem with my sequencing run, or if it was with all lab personnel who followed the same protocol for this primer set. Borrowing a set of sequencing data from a colleague and for gut contents and water from another environment, I found the same sequences occurring, and BLASting as cow DNA, often even outnumbering the host DNA (which is picked up by the primer as well). This helped me confirm that we might be over-amplifying the library, leading to chimeras and resulting in artificat DNA.

For future reference and for those who encounter the same problem, here's what I did:

Use chimera checking and removal with stricter parameters.
Following @SoilRotifer's advice, you can filter your reference database to exclude sequences that do not match your sequences, using qiime quality-control exclude-seqs and then assign taxonomy using the filtered database. I used a percent-identity of 95% and 80% percent-query alignment.

Special thanks to @SoilRotifer for being so patient in helping and guiding me through this issue while I sort-of figured this out.

Have you looked at the sequences that are being assigned as cattle and confirmed that that is what they are?
If your primer is fish specific you should have a relatively small number of sequence variants? If so I would export them as a fasta file and run them in a web blast and see how the taxonomy compares to what you’re getting.
If you’re using a classifier to get the taxonomy you might also want to try using the classify consensus blast method and make sure that they are not very different, which might imply that the issue is with your classifier.

I always use BLAST. My experience with QIIME2's sklearn is good, but it was never clear to me how reads are assigned, so I default to using BLAST as I find it theoretically more straightforward.

SoilRotifer · June 21, 2020, 11:34pm

Hi @emmzee,

I'm super glad you were able to figure it out!

system · July 23, 2020, 5:34am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.