Interpretation of results after taxonomy analysis from decontaminated table

Hi everyone,

I have doubts about the interpretation of my taxonomy files, to be clear, I'll separate my pipeline into topics and add files that are relevant to guide me if I've had an error in the process.

I work with low biomass samples, mosquito midguts from different localities.

1 - I've demultiplexed sequences from Illumina Miseq, that I had denoised using dada2 and chose a single end approach due to poor quality in the extension of the reverse sequences.

2 - I have found reads in my two negative controls, a contamination, to avoid this in my final results I decided to use one approach that eliminate the cross contamination.
I decontaminate my table.qza using microDecon in R, and create a table-decon.qza (decontaminated table) to proceed with my analysis.

3 - I trained a classifier for the v3-v4, looking in the forum I realized that it's the 341-805. The primers are from the Illumina tutorial: you can find the primers here

And I use the sequence and taxonomy qza avaiable here: here

Here are my code to retrain the classifier:

qiime feature-classifier extract-reads \
 --i-sequences ./silva-138-SSURef-341f-805r-Seqs.qza \
 --p-f-primer CCTACGGGNGGCWGCAG \
 --p-r-primer GACTACHVGGGTATCTAATCC \
 --p-trunc-len 450 \
 --p-min-length 100 \
 --p-max-length 600 \
 --o-reads ./silva-138-SSURef-341f-805r.qza


qiime feature-classifier fit-classifier-naive-bayes \
 --i-reference-reads ./silva-138-SSURef-341f-805r.qza \
 --i-reference-taxonomy ./silva-138-341f-805r-consensus-taxonomy.qza \
 --o-classifier ./silva-138-341f-805r_classifer.qza 
  1. Here are my code to generate taxonomy file and taxonomy barplot for the decontaminated table and the "original" table:

    qiime feature-classifier classify-sklearn
    --i-classifier silva-138-341f-805r_classifer.qza
    --i-reads rep-seqs-single.qza
    --o-classification taxonomy.qza

    qiime metadata tabulate
    --m-input-file taxonomy.qza
    --o-visualization taxonomy.qzv

    qiime taxa barplot
    --i-table table-single-decon.qza
    --i-taxonomy taxonomy.qza
    --m-metadata-file /mnt/c/users/Joao/desktop/FASTQ_16S/Metadados.txt
    --o-visualization taxa-bar-plots_decon.qzv

    qiime taxa barplot
    --i-table table-single.qza
    --i-taxonomy taxonomy.qza
    --m-metadata-file /mnt/c/users/Joao/desktop/FASTQ_16S/Metadados.txt
    --o-visualization taxa-bar-plots.qzv

  2. Here are prints from my taxa barplot from the "decon" table and the "original" table. My question is why they are so similar, if I I decontaminated my table? It seems that all my samples have a homogeneous profile, which contradicts the literature that shows that samples from different locations have a heterogeneity of microbiota.
    Why do we have many taxa that are not even passing order or family in taxonomy level?

6 - I also compared this results with my alpha rarefaction and I saw that there is a difference in the diversity of the groups that does not correspond with this homogeneity

Here are my files in case you want to consult:

table-single.qzv (1.0 MB)
taxa-bar-plots.qzv (365.4 KB) taxonomy.qzv (1.6 MB)
rep-seqs-single.qzv (922.5 KB) alpha-rarefaction.qzv (461.4 KB) table-single-decon.qzv (831.3 KB)
taxa-bar-plots_decon.qzv (363.5 KB)
alpha-rarefaction-decon.qzv (461.2 KB)

@joaomiranda,

Overall your approach seems good to me, let's see if we can get you a bit better results.

Lets start here! This is often a result of an incomplete removal of all non-biological sequences before running DADA2. It would be worth checking to make sure that all primers, barcodes, or any other non-biological sequences have been removed. Beyond this being a previously encountered issue, it appears that the number of ASVs returned are ~5000, which is a pretty large number even for a rich sample.

Another thing that could be going on is that low biomass samples can end up having some untargeted, non-bacterial hits. This is especially the case when using V3-V4 primers. I believe that @Mehrbod_Estaki has run into this during a mouse intestinal study and has recommended doing a filter step after running DADA2. The goal of this step would be to remove any non-bacterial sequences that are in your samples. Check out this tutorial and this doc page for some ideas :page_facing_up:.

Also, this is not the source of your issue, but it looks like you are truncating at 450 during the extract-reads step but then trimming to 65/truncating at 240 for your DADA2 step. You could go ahead and use these same settings for your extract-reads step since you ended up using single end reads.

Have you run a PCoA plot? It is possible that some separation does exist but is hard to see on the barplots but might be more apparent with the PCoA plot.

3 Likes

Hi @Keegan-Evans ,

Thank you so much for your comments. Below, I'll summarize what I did according to your guidelines:

I've received demultiplexed sequences to initiate my analysis. To make sure that all non-biological sequences were removed, I've runned cutadapt. Analysing the resulting quality plots, I've noticed some differences in Qscore that changed my settings for trim and trunc in dada2.

Here are my code for cutadapt

qiime cutadapt trim-single \
 --i-demultiplexed-sequences single-end-demux.qza \
 --p-front CCTACGGGNGGCWGCAG \
 --o-trimmed-sequences trimmed-seq-single.qza \
 --p-error-rate 0 \
 --verbose

the visualization file: trimmed-seq.qzv (318.5 KB)

Based on the quality plot, I've decided to run dada2 in a single-end approach (as I was doing previously), but with new parameters for trim and trunc. Analyzing table-single2.qzv in comparison with the table of the approach I showed earlier, I noticed a drop in the amount of features and reads, even with parameters that I considered well adjusted according to the Qscore of the quality plot.

Here are my code for dada2 denoise-single

    qiime dada2 denoise-single \
  --i-demultiplexed-seqs single-end-demux.qza \
  --p-trim-left 40 \
  --p-trunc-len 240 \
  --o-representative-sequences rep-seqs-single2.qza \
  --o-table table-single2.qza \
  --o-denoising-stats stats-single2.qza

the visualization file: table-single2.qzv (781.7 KB) rep-seqs-single2.qzv (719.7 KB) stats-single2.qzv (1.2 MB)

I've trained a classifier using the settings around my trim and trunc parameters, as you said, and use this discussion to help me with the --p-min-length and --p-max-length.

Here are my code for extract reads

qiime feature-classifier extract-reads \
 --i-sequences ./silva-138-SSURef-341f-805r-Seqs.qza \
 --p-f-primer CCTACGGGNGGCWGCAG \
 --p-r-primer GACTACHVGGGTATCTAATCC \
 --p-trunc-len 200 \
 --p-min-length 0 \
 --p-max-length 0 \
 --o-reads ./silva-138-SSURef-341f-805r.qza

Finally, I've repeated my decontamination steps using microDecon in R, and from the decontaminated table I generated a taxa barplot. Some observations:
I realized that there is no longer a taxon designated as a chloroplast, as there was in the barplot taxa I showed earlier. The decontaminated table returns ~600.000 reads and ~3400 ASVs. Are these numbers are good to proceed with downstream analysis? Unfortunately, I still have taxons that don't pass a taxonomic level like domain or class.

My table decontaminated, the taxa barplot from the decon table and from the "contaminated table": table-single-decon2.qzv (686.9 KB)
taxa-bar-plots_decon2.qzv (363.2 KB)
taxa-bar-plots_new.qzv (364.8 KB)

I didn't run a PCoA plot yet, because I want to be more carefully with this data before proceed with diversity analysis (according with moving pictures tutorial, PCoA is generated when running beta diversity commands).

I think my main problem might be that the similarity between the taxa barplot from the original and the "decontaminated" table might be because the ASVs that were removed by microDecon are being assigned to taxons that are also being assigned by the ASVs that remained in the table.
But I think maybe this is an issue for another topic.

Hi @joaomiranda,

I don't see cutadapt being used in the provenance of your table-single2.qzv, are you sure you used the right input file there?

A few different things can be happening leading to you having even less reads in the second run. For one, is it possible you started with less reads after cutadapt? Check to see what your starting reads were between the 2 runs after you ran cutadapt.

Something I like to do with cutadapt if know for a fact that my primers are still intact is to set the --p-discard-untrimmed, this will discard any reads that didn't actually have those primers in them. I justify this by saying anything that is in there without the primers it not what I targeted. This will toss away a bunch of those potentially unannotated features you're seeing.

Next, in your first run you trimmed 65 nt from your 5' and truncate at 240 from your 3'. In the second run you actually trim less from the 5' with 40 and truncate to the same 240 spot. From looking at your quality plots I'm not surprised you're seeing such a massive drop at the initial filtering step of DADA2. Unfortunately your quality scores are quite low on your 5'. If I were you I would trim at least 51 nt to get you passed that initial dip in quality scores, I'd also try a second run with 80+ nt trimming to see you pass that second odd dip too. Luckily for you even if you were to trim that much from the 5' you should still capture a bunch of the V4 region and that should still give you decent resolution.

Now moving to your classification issues, again 2 things come to mind. First, did you see @Keegan-Evans 's recommendation about filtering out host reads?

There's a previous discussion on this if you want to see an example of what to do here. You can use much lower % identity and alignment than what I did a few years ago in that post. The idea is to just filter your reads to toss away anything that doesn't look like 16S, we're not looking for perfect matches at that step.

The second I noticed was that when you are extracting your reads for training the feature-classifier, you are truncating your reads at 200 with no trimming, however your dada2 parameters have you doing a 40/240 trim/truncate. What is essentially happening here is that the region you have extracted and trained your classifier is different than the ones where your ASVs represent. So, as @Keegan-Evans already mentioned:

I'm not sure if that is going to resolve all your classification issues, but it should certainly improve it! I'm of the opinion that you are getting a lot of reads that are coming from the host cells and are not true bacteria reads. One easy thing you can do to check this, is just find some of those ASV sequences that are not hitting past Domain/Phyla level and BLAST a few of them to see what they actually hit. The rep-seq visualizer already provides you with hyperlinks to BLASTn so is super easy to do.

Try these out and get back to us, hopefully we'll se some improvements!

2 Likes

Hi @Mehrbod_Estaki ,

Thanks a lot for your observations! I'll put it into practice and return later with better results.
But before that, I would like to clear some doubts:

I'm sorry :man_facepalming: this is the wrong input, I've runned dada2 without the output from cutadapt, I'll run correctly now.

I've runned cutadapt with --p-discard-untrimmed:

qiime cutadapt trim-single \
 --i-demultiplexed-sequences single-end-demux.qza \
 --p-front CCTACGGGNGGCWGCAG \
 --o-trimmed-sequences trimmed_seq_single.qza \
 --p-error-rate 0 \ 
 --p-discard-untrimmed \
 --verbose

qiime demux summarize \
  --i-data trimmed_seq_single.qza \
  --o-visualization trimmed_seq_single.qzv

Here you can visualize my quality plot, and I have some questions about the trim and trunc parameters after cutadapt: trimmed_seq_single.qzv (296.1 KB)

Based on the middle of the box in the positions of the quality plot at my 5', we can see a low Qscore in 43 and 44 nt but in the 79 nt, where are other odd dip, we can see a Qscore 38 looking for the middle of the box. So based in this observations and looking for the quality in the 3' I thought about running --p-trim-left 45 and --p-trunc-len 235 , is that correct?

Thank you so much! I'll take a look in this discussion and try to apply in my data

I don't understand my error here, I based my truncating in the extract-reads based on this discussion. I used the truncating parameter based on the length of my resulting amplicon (240-40 = 200) and set --p-min-length and --p-max-length to 0. How can I identify the region where my ASVs are correctly represented? I have to insert --p-trunc-len and --p-trim-left in my code with the same parameters as I used in dada2?

Thanks in advance!

1 Like

Hi @joaomiranda,
Sounds good, keep us posted on the new results.

Based on the quality plots I see my gut feeling is you're still going to lose a lot of reads during the filtering process. But I could we wrong, we'll see after the run is over. For runs like this I like to start with conservative trimming and then if the results are good, I can relax my parameters. In your case I would truncate at say at the 200 position, at least, before that big dip you see on the 3' tail. The "median" marker you may have read in the forum is a recommendation as a starting point, it's not a definitive rule. For trimming the 5', you may be right that your 45 trim parameter might be just fine but remember that your quality plot is based on only a 10,000 randomly subsample of your total reads. When I see that odd second dip, in my head I think there's probably a ton more reads that have that dip in quality, and it's possible all of those might get filtered out before they are denoised. So again, I would start with conservative parameters with DADA2, then relax them if you think your depth is sufficient.

Note that that discussion is about when working with paired-end reads. PE reads are somewhat different because after merging you can have variable length amplicons so we can't cut to a constant length. In your case because you are using single-end reads, you should just make sure you extract the same length as your DADA2 trim/truncate values. You can always compare the results to a "full length" classifier that is freely available in the data resources page to see how they compare.

You already have that information. When you use your primers in amplification, you focus a specific region. Then later on, you use the same sequences to extract reads from a reference database. So essentially you end up with a reference database that has reads containing the exact same region as your primers in real life. But since you are trimming and truncating your reads during DADA2 further, it makes sense to also trim and truncate your reference reads to the same length. Again, this isn't necessary per se, as long as your reference reads fully encompass your primer region (thus why you can use full length classifier), but extracting your reference reads to the same exact region as your primers has been shown to improve classification a bit.

Hope that makes sense?

Hello again @Mehrbod_Estaki,

I have some results and new questions.

After cutadapt I've runned dada2 denoise-single, and tested different parameters for trim and trunc. --p-trim-left 80 and --p-trunc-len 200 were the better parameters, with a good number of reads conserved and analysing the stats from denoising, a good percentage of input passed filter.

Check here: table-trim.qzv (706.3 KB) stats-trim.qzv (1.2 MB)

Next I've runned exclude-seqs, here is the code below:

qiime quality-control exclude-seqs \
  --i-query-sequences rep-seqs-trim.qza \
  --i-reference-sequences silva-138-SSURef-341f-805r-Seqs.qza \
  --p-method vsearch \
  --p-perc-identity 0.90 \
  --p-perc-query-aligned 0.90 \
  --p-threads 4 \
  --o-sequence-hits hits.qza \
  --o-sequence-misses misses.qza

qiime feature-table filter-features \
  --i-table table-trim.qza \
  --m-metadata-file misses.qza \
  --o-filtered-table filtered-table-trim.qza \
  --p-exclude-ids

And I retrained the classifier with the same trim and trunc parameters that I've chose above.

qiime feature-classifier extract-reads \
 --i-sequences ./silva-138-SSURef-341f-805r-Seqs.qza \
 --p-f-primer CCTACGGGNGGCWGCAG \
 --p-r-primer GACTACHVGGGTATCTAATCC \
 --p-trunc-len 200 \
 --p-trim-left 80 \
 --p-min-length 0 \
 --p-max-length 0 \
 --o-reads ./silva-138-SSURef-341f-805r.qza

Let's go to the results, I have two taxa barplots files, one using a table before the exclude-seqs and the other after apply this code, so in both there is a poor heterogenity in comparison with the other two runs that I posted above. It looks like that just one taxon was designed, and this taxa barplots are without the decontamination with microDecon.
This parameters of dada2 with cutadapt gave me the best table I ever had, unfortunately I don't understand this pattern in my taxonomy.

Here are the files, additionally there is a alpha rarefaction plot which indicates to me differences in the diversity of my groups, which still doesn't make sense in relation to these taxa barplot:
alpha-rarefaction-filtered.qzv (500.3 KB) taxa-bar-plot.qzv (353.7 KB)
taxa-bar-plot_filtered.qzv (406.6 KB)

1 Like

Hi @joaomiranda ,
Looks like things are progressing nicely and your commands all look good to me

A couple of minor comments, not sure if anyone of them would actually do much in your overall results.
How many reads and unique ASVs were retained after the negative filtering step you did? (exclude-seqs). You could try setting your alignment/identity parameters to .65 which would run much faster and probably have the same result and maybe include a few more features if they are not well defined in the reference database.

I'm surprised still to see some taxa not identified beyond Kingdom level, but, if this is a not well-characterized sample maybe they are real bacteria, just not present in your reference database. You can try BLASTing a few of those ASVs that came up with poor classification to see if they hit anything. Otherwise, you have 2 choices there, either a) keep them in your table as true "unknown" features, or, filter them out as contaminants.

I'm not sure what your expectations are here but alpha diversity and these taxa barplots are not necessarily connected. Your alpha diversity results are based on "ASVs", while the taxa barplot is operating on species collapsed (or higher) levels. Taxa barplots are also very hard to see visually when you have a dominating taxa such as in your case. This is why we recommended you looking at a PCoA plot of these results to see if there are any obvious group differences, those can be plotted with ASVs instead of collapsing.

At this point I think you can be comfortable with your results and start looking for the biology in the results.

Last comment, not relevant to you results here, but you may consider building your tree using the fragment-insertion tool as it has been shown to be more accurate than building one de novo when working with short target-gene sequences.

Update!

Hi @Mehrbod_Estaki ,
I think we have some good news, I runned the same commands that we had discussed here and set my alignment/identity parameters to .65.

But this time I used a taxonomy.qza from a pre-trained classifier Greengenes 13_8 99% OTUs full-length sequences. When I looked to the barplots I had some improve in the richness, heterogenity and found taxons that are common in the literature about microbiota from mosquitoes.

So this findings makes more sense to my research, now I'm feeling more comfortable to proceed with core diversity metrics and plotting the PCoA.

I'm relative new in qiime, so this step that you have mentioned it's to proceed with my table before the core diversity metrics?

Here are my table.qzv files before and after the decontamination using microDecon, I think I have a good number of reads and ASVs, also there are my taxa barplots before and after decontamination.

filtered-table-trim2.qzv (662.8 KB)
filtered-table-trim-gg-decon3.qzv (542.3 KB)
taxa-bar-plots-gg.qzv (1.3 MB)
taxa-bar-plots-gg-decon3.qzv (1.2 MB)


Thanks in advance

Great! Well I'm glad things have improved. I wonder if the difference was because of the 90% identity threshold being too strict or the use of Greengenes over Silva. :man_shrugging:

Yes, the tree you build here will be used when you are calculating diversity metrics that incorporate phylogeny, for example weighted or unweighted UniFrac (beta diversity) or Faith PD (alpha diversity). In your case if you use diversity core-metrics-phylogenetic, you'll need a tree as one of the inputs.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.