Help troubleshooting for defining best dada2 parameters

JoaoGabrielMoraes · August 31, 2020, 5:34pm

Hello,

I have 16S sequencing data which I am trying to decide which dada2 parameters to use. Our data was generated using 515F-806R primers and was paired-end sequenced (2x250 cycle).

I used the plugin cutadapt to trim off PCR primers (515F/806R) from raw sequences.

Please see attached the summary of the demultiplexed results. trimmed-paired-end-demux.qzv (315.0 KB)

Looking at the Interactive Quality Plot, the quality of the forward reads drop around ~80 sequence bases.

My pipeline was initially setup to truncate forward and reverse reads at 135, which was based on a different data that we have. Please see the attached results denoising-stats-dada2.qzv (1.2 MB)

Please note that when truncating forward and reverse reads at 135 only about 57% of the reads passed data2 filter.

When exploring the taxonomic composition of the samples, there were several microbes identified as much as at the genus/species level taxonomy.qzv (1.5 MB).

However, if I use only the forward reads and truncate at 135, 96% of the reads passed data2 filter. denoising-stats-dada2.qzv (1.2 MB) . However, when exploring the taxonomic composition our samples, the majority of the OTU are unassigned taxonomy.qzv (1.6 MB)

Can you please help me understand the reason why when using only forward reads most of the OTUs are unassigned?

Also, based on the attached summary of our demultiplexed results, what would be your suggestions for data2 parameters to use?

Thank you very much!

Joao

yanxianl · August 31, 2020, 6:11pm

Hi Joao,

For typical sequence data generated from Illumina platforms, the forward reads quality should be much higher than that of the reverse reads (for illustration, check this post in the forum). But it's opposite in your interactive quality plot: the quality of forward reads crashed at the base position 185 whereas the quality of reverse reads crashed at the base position 229. This strongly suggests that something went wrong during the read demutiplexing. You may first want to check the code you used for demultiplexing the reads and see what could have possibly gone wrong.

JoaoGabrielMoraes · August 31, 2020, 8:18pm

Hi @yanxianl,

Thank you very much for your prompt response.

I am using the following command to import, demultiplex, and cut off adaptors:

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path manifest_DAY14_STUDY.txt\
  --output-path paired-end-demux.qza \
  --input-format PairedEndFastqManifestPhred33V2

qiime demux summarize \
  --i-data paired-end-demux.qza\
  --o-visualization demux.qzv

qiime cutadapt trim-paired\
  --i-demultiplexed-sequences paired-end-demux.qza\
  --p-cores 10\
  --p-adapter-f ATTAGAWACCCBDGTAGTCC\
  --p-front-f GTGCCAGCMGCCGCGGTAA\
  --p-front-r GGACTACHVGGGTWTCTAAT\
  --p-adapter-r TTACCGCGGCKGCTGGCAC\
  --o-trimmed-sequences trimmed-paired-end-demux.qza

qiime demux summarize\
  --i-data trimmed-paired-end-demux.qza\
  --o-visualization trimmed-paired-end-demux.qzv

Also, I am not getting any errors in my slurm output. Any ideas on what could be going wrong?

Thank you, Joao

yanxianl · August 31, 2020, 9:42pm

I think the sequence data were already demultiplexed before the data import. Any chance that you could share the manifest file (manifest_DAY14_STUDY.txt) and the quality plot of the demultiplexed sequence before primer trimming (demux.qzv)?

JoaoGabrielMoraes · August 31, 2020, 9:59pm

Hi @yanxianl,

Please find attached the information you asked.

manifest_DAY14_STUDY.txt (20.9 KB)

demux.qzv (309.7 KB)

Thank you, Joao

yanxianl · September 1, 2020, 9:38am

Hi, the base quality of forward and reverse reads changed a lot after you trimmed the adaptors. Since the reads were already demultiplexed, I don't think they contain adaptor sequence. Maybe you can try trimming the primer sequence only and see if you get something different:

qiime cutadapt trim-paired\
  --i-demultiplexed-sequences paired-end-demux.qza\
  --p-cores 10\
  --p-front-f GTGCCAGCMGCCGCGGTAA\
  --p-front-r GGACTACHVGGGTWTCTAAT\
  --o-trimmed-sequences trimmed-paired-end-demux.qza

Alternatively, you can directly trim off the primer sequence using the --p-trim argument within dada2. Based on the quality score shown in demux.qzv, you can try running the following code and see what you get:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs paired-end-demux.qza \
  --p-trim-left-f 19 \
  --p-trim-left-r 20 \
  --p-trunc-len-f 240 \ #  you can change to what you think is fit for the purpose
  --p-trunc-len-r 240 \ # you change to what you think is fit for the purpose
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza

qiime metadata tabulate \
  --m-input-file denoising-stats.qza \
  --o-visualization denoising-stats.qzv

Check other posts in the forum (here and here) on primer trimming .

JoaoGabrielMoraes · September 1, 2020, 4:17pm

Hi @yanxianl,

Trimming only primer sequences makes things wrose. Please see trimmed_PRIMERONLY_-paired-end-demux.qzv (314.5 KB).

I am curious about why the quality scores of my forward sequences are worse than the ones from the reverse sequences. I wonder if that has to do with quality of the microbial 16S rRNA sequenced, and not with how the data was handled. The tissues been analyzed in this experiment are thought not to have a big microbiome, so I wonder if the microbial 16S ribosomal RNA sequenced are mostly from background noise, and perhaps the quality of the RNA sequenced was not very good. Do you think that's possible?

Hi @Mehrbod_Estaki, if all possible, can you please share with us your thoughts with this issue I am having? I know you have a lot of experience with this so I would really appreciate to also hear your thoughts.

Thank you so much!!

Joao

Mehrbod_Estaki · September 2, 2020, 7:41am

hi @JoaoGabrielMoraes,

I think @yanxianl's suggestions are right on the target here. I agree in that your forward reads do look rather odd in that they are tanking in quality well before we expect them, especially when the reverse seems to be in better shape, though they also look pretty odd once you trim with cutadapt.

Regarding these "Unassigned" or assigned to only "Kingdom" level ex. d_bacteria..can you tell us what percentage of your data do they make up? You can look up/share the taxa barplots for a quick sanity check. That can tell us if you're having a classification issue, or maybe host contamination. I have a hunch its the latter. You can try blasting some of those poorly assigned reads to see what they hit.

Another point of interest, your quality plots following cutadapt have a sudden drop at 185 not only in quality but also ~ 40% of your reads are no longer that long, which is odd to me, no pimer/adapters should be that far into your reads that cutadapt would remove them. This might actually be a good indicator that your cutadapt biasedly removed some group all together, though we aren't sure what those were, either target, or contamination.

Things can get a bit messy indeed when low biomass samples are being sequenced, especially if you have a higher host cell:bacterial cells ratios. However, the actual sequencing/quality scores shouldn't be affected by that. If you have untargeted sequences, then you'll get high quality untargeted sequences.

Ultimately my suggestion is to start at the top and discuss with your sequencing center, since I think the issue is not with handling the data, but the actual sequencing. And those folks would have a much better understanding of these quality scores. You can also ask them about whether the primers/adapters etc have been removed or not, trying to remove them twice may also be problematic. Also, I would first focus on your forward reads only, and once its all sorted then deal with the reverse reads.

Sorry couldn't be any more help.

JoaoGabrielMoraes · September 2, 2020, 5:46pm

Hi @Mehrbod_Estaki,

Thank you so much for your input!

Looking back on my demultiplexed results (before trimming off adaptors)[ (demux.qzv (309.7 KB)] good quality forward and reverse reads dropped to ~100 bp, but reverse reads are actually in worse shape than the forward reads.

I checked with the Core and the adaptors had not been removed from the fastq files that I am using.

The adaptors that I am removing with cutadapt have ~20 bp, and if you look in the forward reads of the trimmed file trimmed-paired-end-demux.qzv (315.2 KB), you will note that the good quality bp dropped from ~100 to ~80 bp which makes sense. However, it's odd that adaptor trimming somewhat improved the quality score of the reverse reads.

How can you tell that ~40% of my reads are no longer than 185 bp?

I think @yanxianl idea of removing the adaptors in dada2 based on adaptor position rather than sequence is a nice way for trying to sort this out. My question with this approach is how can I make sure we are also removing additional adaptors (e.g. Illumina sequencing adaptors) from our reads. My understanding is that removing sequencing primers (515F/806R) with cutadapt also also removes Illumina sequencing adaptors, as the later are located at end of the read.

I completely agree with you that my problem is probably related to working with low biomass samples. I don't have this same issue with another dataset with high biomass samples.

Regarding the taxonomy analysis, please see the taxa bar plots for both:

only forward reads truncated at 135 F_135_taxa-bar-plots.qzv (383.5 KB)
using both forward and reverse reads truncated at 135 taxa-bar-plots.qzv (1.6 MB) .

I think the reason why my taxonomic analysis is failing when only forward reads are included (truncated at 135) has to do with the length of the sequences. When both forward and reverse reads are used the mean sequence length is ~244 bp rep-seqs.qzv (516.0 KB) as opposed to only 135bp when including only forward reads F_ONLY_rep-seqs.qzv (593.6 KB).

Once again, I really appreciate your feedback!

Thank you, Joao

Mehrbod_Estaki · September 8, 2020, 6:42pm

Hi @JoaoGabrielMoraes,

This isn't technically what is happening here. The reason why the reverse reads look "better" is that you now have majority of your reverse reads that super short, so the reads that are not trimmed that much appear to be in better quality.

I'll use your new trimmed-paired-end-demux.qzv as an example here. Hover your mouse over the q-score plot. You'll see a line in the description that says:

The plot at position 23 was generated using a random sampling of 6060 out of [object Object] sequences without replacement[...]

Since by default these plots are using 10,000 random reads for demonstration, we know that about 40% of your reads following cutadapt are now never 23 bp long.

So something weird has happened during your cutadapt step, in particular to your reverse reads because we don't really see this issue with your forward reads.

These do not look good at all, with majority of your reads being unassigned. The forward reads are particularly problematic because there isn't even an option to view beyond level 1 in the taxonomy levels.

The taxonomy issues might be a) related to how you created your classifier or b) you truly have that much host contamination. This is worthy of starting its own issue with the PRESCRIPt folks once we sort out the feature-table stuff here to make sure the issue isn't just with the processing.

I strongly doubt that, we get much better classification with even 100 bp than what you're seeing here. This is a separate issue that is only worth trouble-shooting once we sort out the feature-table issue.

You have a few options here, this is what I would do

A good starting place is to reprocess with just your forward reads for now
Instead of removing adaptors, try using your primer sequence in cutadapt, this will look for your primer sequence and remove everything before it, including the adaptors and anything else. You an also set the --p-discard-untrimmed parameter and that does an extra level of filtering where any read that doesn't have the primers in it gets discarded.
Denoise, truncate to 100bp. If you use Deblur, it has a nice negative-filtering step in it that gets rid of crazy looking reads that are often host contamination. If you use DADA2, I would recommend doing a similar negative filtering step afterwards. I find this step to be super useful when I'm expecting a lot of host contaminants. This thread explains this process a bit more in detail (but keep in mind that the % identity and coverage I used in that thread were too high. I would use something like 60-65% instead now)
Re-run your feature-classifier and assign taxonomy.

This should be your standard bar, what you get here is essentially the maximum number of reads you should expect. Once you get a decent output from this step then you can go back and fine-tune further to get longer reads if you want. As you increase the length of your reads, the # of reads you retain will go down. I'm not sure with your current data if you'll be able to ultimately use the reverse and merge reads with a decent # of reads, but I will say that the increase in resolution that you get from say 100, 150, or 250 bp with regards to taxonomic classification is not that great. So ultimately, it's not that big of a deal if you can't merge them, or even get much longer reads.

From Wang et al. 2007:

Hope this helps!

JoaoGabrielMoraes · September 10, 2020, 4:47am

Hi @Mehrbod_Estaki,

Looks like we are on the right track here. Thank you very much for your help!

Attached are the results after using only the primer sequences and including the --p-discard-untrimmed parameter in cutadapt. Please see trimmed-paired-end-demux.qzv (316.8 KB). Quality scores really improved in both forward and reverse reads.

So my guess is that a lot of the low quality reads we were seeing before trimming were from untargeted sequences.

After seeing the results from cutadapt, I decided to use both F and R reads in DADA2. By truncating F and R at 162, ~96% of my reads passed DADA2 filter and ~95% of the forward and reverse reads merged. denoising-stats-dada2.qzv (1.2 MB). I won't lie, I got really excited with these results.

However, when doing the taxonomic analysis, the majority of my reads are still Unassigned. Please see taxa-bar-plots.qzv (1.7 MB) taxa-Day14-table.qzv (1.3 MB) taxonomy.qzv (1.5 MB)

In the taxonomy.qzv file, please note that taxa were assigned for over 1800 ASVs. However, the majority of the reads detected were unassigned - see "taxa-Day14-table.qzv" file.

Because we removed the reads without primers in cutadapt, the Unassigned reads shouldn't be from host contamination, so my guess is that something is not right with my taxonomic analysis.

I am using the following code in feature-classifier:

qiime feature-classifier classify-sklearn \
  --i-classifier silva-138-99-515-806-nb-classifier.qza\                                             
  --i-reads rep-seqs.qza \
  --o-classification taxonomy.qza

Any suggestions on how to move foward?

Thanks in advance!

Mehrbod_Estaki · September 10, 2020, 8:01am

Hi @JoaoGabrielMoraes,

That's great news! Glad to hear things are starting to smooth out.
I think at this point we can start a new thread to discuss the taxonomy issue separately as the initial inquiry seems to be resolved here. Also it will be easier to recruit taxonomy experts if needed to a fresh thread.

Not necessarily. You could still have untargetted amplification with your primers if they somehow found a similar target on the host.

Can you also try a few things in the meantime and report them in the new thread.

describe what the sampe-types are, if you expect more host than bacteria DNA etc.
Did you try doing a negative filter as I recommended earlier? It's possible that those unassigned reads are host contamination that someone have made it through. A simple negative filtering can easily remove them. Though they are the overwhelming majority of your reads, so that won't necessarily be a win either.
Can you also BLAST a few of those unassigned ASVs to see what they are hitting? You can do that using your rep-seqs.qzv, alongside with your taxonomy.qzv to find the unassigneds. Sharing that rep-seqs.qzv in the new thread will also help. If they are hitting real bacteria, then something is wrong with the classifier. If they are hitting uncultured or unkown host stuff then the issue is probably because you have low biomass samples with lots of untargetted reads.
Double check that the primers used in your run are the exact same that were used in the pre-trained classifier you are using.

That should be a good start.
Good luck!

JoaoGabrielMoraes · September 11, 2020, 7:11pm

Thanks again @Mehrbod_Estaki!

I will open a new thread as suggested and post my new findings there after going through your suggestions.

I own you at the least!

Mehrbod_Estaki · September 11, 2020, 8:49pm

Glad to be of help! You don't owe me anything, but I'm never one to say no to a drink with a friendly scientist. Here's to hoping

system · October 13, 2020, 2:49am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.