Mock community taxonomy not accurate, could cutadapt step be causing this?

SarahM · July 9, 2020, 9:42pm

Hello, I am a new user of QIIME2 as well the forums, and I apologize if I am posting this in the wrong section.

I am analyzing fungal ITS1 data that were amplified with the BITS/B58S3 primers. I included a mock community from Bakker et. al 2018 (staggered B mixture). Sequences were acquired on an Illumina 250x2 run.

The fastq files were already demultiplexed and the primers/adapters already trimmed off. Before importing into qiime2 I ran the sequences through DADA2 pipeline mainly following the dada2 ITS processing tutorial which uses cutadapt to remove primers and read-throughs.

All sanity checks were cleared however when I assigned taxonomy my mock community looks nothing like expected, the most abundant taxa is not identified at all. I trained my own classifier using the most recent UNITE database release (dynamic developer files).

There are 16720 reads in the mock community and my negatives have around 50-100 reads so I do not think it failed to amplify. Also, the species in my mock do not appear in my negatives so it does not look like contamination (?).

The only thing that seemed odd in the analysis was that when I searched for primer hits before running cutadapt there were hits for only the: a) reverse complement of the forward reverse reads, and b) the reverse complement of the reverse forward reads. I thought this was due to read through.

However, while troubleshooting I ran a subset of my data through the same analysis and despite having only a fraction of the files, I got MORE primer hits before cutadapt.

Can anyone help me understand,
a) why my mock community results might so far from expected, and

b) Why I would see more hits for my primers in a smaller subset of data, and could this have anything to do with my mock community taxonomy issues?

I would be extremely grateful for any guidance.

taxa-bar-plots.qzv (484.4 KB) rep-seqs.qzv (284.4 KB)

llenzi · July 9, 2020, 10:10pm

Hi @SarahM,

I am wondering if the sequence orientations you are expecting is somehow wrong, and that may explain the primer matches you are seeing as well as why the sk-learn classify fails in your case!

What happen if you try to assign with blast (classify-consensus-blast: BLAST+ consensus taxonomy classifier — QIIME 2 2020.6.0 documentation), instead (or maybe just blast a couple of sequences to get the feeling)?

An alternative option it may be to try a close-reference approach (cluster-features-closed-reference: Closed-reference clustering of features. — QIIME 2 2020.6.0 documentation), to get the feeling of what is in your sample.

As an aside, you may want to look at the itsxpress plug in to trim off ITS flanking regions (Fungal ITSx and qiime2)!

Hope it helps!

Nicholas_Bokulich · July 9, 2020, 11:30pm

@SarahM,
Out of curiosity, what is the expected composition? The Bakker paper is behind a paywall so I cannot access.

Is it possible that your observed/expected species are anamorphs/teleomorphs of the same species? Or a case of outdated taxonomy/synonyms?

Mock communities almost never have 100% accuracy, especially for ITS (due to copy # heterogeneity, amplification bias, etc). But what you describe (totally different composition) sounds like a technical error.

Just to give my 2 cents in addition to @llenzi's great advice:

Always a good sanity check

This is essentially the same as using BLAST for a sanity check... so worth using if you think the issue is arising during denoising, but otherwise probably a redundant test.

Trimming to ITS (vs. using cutadapt to trim to primer region) in my experience does not impact accuracy to such a vast degree... certainly not to the same degree you seem to be seeing. q2-itxpress is certainly worth trying but I don't think that's the fault here.

Sounds counter-intuitive, but since this is outside QIIME 2 I can't really troubleshoot the issue...

No worries! Since it sounds like the methodology concerned is mostly in R, I have recategorized this as "other bioinformatics tools".

I hope that helps!

SarahM · July 10, 2020, 3:24pm

Thank you both for your input!

I very much hope it is a technical error and not on the wet-lab side

Here is the composition, I have no S. cerevisiae ID's after taxonomy assignment, so even with amplification bias I find that rather alarming. The relative abundances were determined by 16S gene copy number for each species.

I do not believe it could be outdated taxonomy because I trained a qiime2 classifier using the most recent dynamic developer files. I also made classifiers using the 99% threshold as well as trying both of these with the previous release of the UNITE database, none of that changed the results.

It seems it must be an issue with either the filterandtrim steps or the cutadapt steps but maybe there is something farther downstream in my qiime2 pipeline I have not considered

llenzi · July 10, 2020, 7:16pm

Hi @SarahM,

thanks for your reply! I'll try to give more ideas, some may be obvious questions to you but please keep in mind I am not an expert on ITS analysis!

Starting form the bench side (and I am not a bench guy!), the mock community was made by you (or anyone in your lab) or was provided externally? It would not been totally unusual to me to work with something received and it is not what I expect

But let assume all the species are in there.
Do you have the dada2 denoising statistics? I am more familiar with the q2-dada2 and its denoising stat, from which you can spot which is the limiting step for the process (quality filtering - merging - chimera detection). Are all looking good in your case? Are most of your sequences passing all these steps? Did you use the option to concatenate sequences which would not show any overlapping region?

Denoising R1 as single read (or R2, I am not sure which would more significative for this ITS region), instead of using both as paired-sequences, does it change the species profile?

Were you able to assign the taxonomy with a different algorithm, as blast/vsearch, as further sanity check?
Did you try to use an ad hoc database including only the species in the communities to see how many (if any) are out of target in the mock-sample (and maybe focus on them on what they are coming from?)

Given you are missing the (supposedly) highest abundant species, if you blast a subset (10%, more?) of the raw reads for the mock sample, can you see it S. cerevisiae in there?

I really hope answering these question may help you to solve your mystery!

Cheers

SarahM · July 10, 2020, 8:01pm

Hi Luca, Thank you for your response you raise some good points!

I received the mock community from the USDA lab that Bakker et al published his paper out of. Other researchers have acquired mock communities from the same place with good results (including the researchers who posted this DADA2 ITS processing tutorial DADA2 ITS Pipeline Workflow (1.8)). So I am hesitant to suspect the mock itself, safer to assume this is something I did wrong for now. Also, I am not sure how I would determine if it was a bad mock anyways. Any ideas are welcome.

good question! based on my understanding of the quality scores and filtering steps everything was okay, but this is my first time doing amplicon-based community profiling so I could be wrong. Here are the error rates and the quality plots for the mock community and another sample for comparison for fwd and rvs reads.

As far as the error rates go, I retained an average of 87% of my reads after dada2, which seemed reasonable to me? (tbh I have no idea). As far as how many sequences survived each step, here is the breakdown for the mock community.

Another good question! I am guessing you are referring to the merging paired-end reads step? This was ran with the default minimum overlap of 20bp. The average amplicon length for S. cerevsiae was reported by Bakker et al to be 480bp, I collected data on a 250x2 run which WOULD leave potentially less than 20bp overlap so perhaps you are onto something! I will re-analyze a subset of data concatenating. Although most of my fwd and rvs sequences are between 200-220 bases long, so I worry that MOST sequences have enough overlap to be merged correctly and I might loose some data quality, do you have any thoughts on this?

I was just working on that this morning, as a newer user things take me a lot longer than they should ...

Thank you so much for your thoughts and suggestions, I am extremely grateful!
-Sarah

llenzi · July 10, 2020, 8:38pm

Hi @SarahM,

If as you say the expected length for S cerevisiae's amplicon is 480 bp, you leave about 20 bp for the overlapping step (the suggested minimum is 12 bp as far I know).
Could you explain how do you preprocess the sequences? What is included in the sequence length between 200-220bp? Are these sequences after adapter/primer trimming? Or did you quality filter the reads before dada2?
My stronger suspect now is that the cerevisiae amplicons are not really able to merge after the initial quality trimming.

To confirm this, try to process R1 as single read, after which you should be able to find the S cerevisiae, I hope at least ...

Sorry if I do not comment the dada2 plots, mostly because would be outside of the scope of this forum, but also because I am not really a user of dada2 in R and I would not give to you inaccurate informations!

Cheers

SarahM · July 11, 2020, 12:20am

Hi Luca,

I tried merging with a minimum overlap of 10bps and it did not change the results at all. I also tried the 'just concatenate' but there was an error that prevented me from constructing taxa barplots.

The primers and adapters were trimmed from the ends by the sequencing facility we sent the samples to. After receiving demultiplexed files for each sample, I followed the DADA2 ITS processing tutorial. I filtered out sequences with ambiguous bases, then I ran the sequences through cutadapt to remove any 'read-throughs' into the opposite primer sequence. Then did the filter and trim step, learned error rates, dereplicated the reads, performed the sample inference step, merged the paired-end reads, and then removed chimeras.

If you have any more ideas please let me know!

llenzi · July 11, 2020, 10:40am

Hi @SarahM,

I'm sorry I did ask about the concatenation option to be sure you are not using it! I don't think the taxonomy can handle it at the moment!

If the facility trimmed the low-quality tails of the sequences (but from what you wrote it does not seem the case ...), you may ask for the raw sequences (after demultiplexing if that would be easier for you), and see if letting dada2 handle the low quality tails gives you a bit more freedom for the merging step.

For the mock community, you may try to skip the cutadapt step for the read-throughs, to see if make a difference, but I would not expect much impact on the cerevisiae given its amplicon size.

Other than that, I am back to my point of using R1 (or R2) as single read.

Cheers,
Luca

SarahM · July 12, 2020, 3:11pm

Hi Luca,

I tried this and the results were only able to be identified to the taxonomic level. I have no idea why.

Thank you for your help, let me know if you have any other ideas.

Nicholas_Bokulich · July 12, 2020, 8:14pm

@llenzi has made the best suggestion yet — perhaps I am missing it but I do not see above if you have tested this:

The issue you are having is most likely a merging issue with dada2. You are selectively losing Saccharomyces (and perhaps some other taxa with long ITS1 domains)... otherwise your taxonomic profiles are looking fine (i.e., the other most abundant expected species are observed at roughly the expected ratios in the mock community). It looks like you are only losing a few hundred reads during merging, but these could make all the difference (i.e., this could be your Sacch sequences):

This would be reduced further by your truncation settings, so it seems quite likely that Sacch seqs would fail to merge unless if you do not truncate your reads (in which case you lose more during the filtering step).

So I advise you do as @llenzi recommended a few times above: toss your reverse reads, and process your single-end reads alone. I know it feels like a waste, but it is a very common problem with ITS due to length variation...

llenzi · July 13, 2020, 8:11am

Hi @SarahM,

I tried this and the results were only able to be identified to the taxonomic level. I have no idea why.

I am not sure I understand, what taxonomic level?

Still, I can only reiterate and agree with @Nicholas_Bokulich.
It may be frustrating overall, but I think this is a very good use (if not THE use!) of a mock community! This is what they are for: guiding you in the choice of the best analysis for your data, and now you know that working with paired-end reads could pose a bias on the whole analysis (you may don't know how many species in your samples have long amplicons which you potentially loose working with paired-end data!).

To my eyes, think at your mock community as investing in a insurance which has not payed you off 100%, but is limiting the loss

Cheers,
Luca

SarahM · July 13, 2020, 8:41pm

Thank you both for the input.

Sorry, I meant to the phylum level.

I tried using only the forward reads and only the reverse reads and there were barely any changes in the composition of the mock community.

using only forward:
taxa-bar-plots.qzv (548.6 KB)
using only reverse:
taxa-bar-plots_R2.qzv (488.3 KB)

At what point would I conclude that the data is not useable?
I am not sure what else could be going wrong other than there are just not sequences for cerevisiae present in the data, and since I am expecting cerevisiae in my actual samples, this would mean I would have erroneous results for the entire study.

I forgot to mention that there are SOME cerevisiae ID’s in the species list, just not dominant in the mock community. So perhaps there was something wrong with the mock I received?

If you have any more ideas please let me know.

Nicholas_Bokulich · July 13, 2020, 9:57pm

Maybe you should check for presence of Saccharomyces sequences in the mock community raw FASTQ.

A QIIME 2 method that will allow you to do this is in q2-quality-control.

create a FASTA containing one or more S. cerevisiae ITS sequences
import to QIIME 2 as FeatureData[Sequence]
create a bowtie2 index
use filter-reads to screen the FASTQ for the mock community sample (not all samples since other Saccharomyces is clearly present).
use qiime vsearch dereplicate to dereplicate the surviving sequences.
use qiime feature-table tabulate-seqs to create a summary file of those sequences, allowing you to directly query NCBI BLAST (to see if these actually belong to S. cerevisiae).
If it is in fact S. cerevisiae, you should be able to calculate from the read counts whether this constitutes a substantial portion of the mock community, or if this corresponds to the S. paradoxus that you are seeing in the mock community now (which is closely related to S. cerevisiae so could be misclassification either by dada2 (seq error not corrected), QIIME 2 (taxonomic classification), or by whoever identified the isolate that was added to the mock community? Comparing that sequence to the original sequence would help figure out)

SarahM · July 25, 2020, 6:20pm

Hello all thank you for all your help with this.

With some help I was able to pull out the S. cerevisiae sequence from the data published by the creators of the mock community I BLASTed it and it was a perfect match to S. paradoxus which is closely related to S. cerevisiae.

In my own data, I had a longer sequence in my mock community that was also a perfect match to S. paradoxus when I BLASTed it.

So it would seem that when they created the classifier in the paper they only included species they were expecting to see and therefore that sequence was classified as S. cerevisiae, not S. paradoxus.

In conclusion, it seems like my data is okay after all I just need to keep in mind that species-level identification is not accurate? ¯_(ツ)_/¯

-Sarah

system · August 26, 2020, 12:20am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.