Only one taxonimic level?

Hi there,

I have metagenomics SE data for an insect. I'm test running on 32 samples, which has ~18M reads. I ran DADA2, made phylogeny trees and then taxonomic analysis. I've tried all 4 databases below, but in all cases, the output only has Domain Taxonomic level, which is bacteria. Attached are silva outputs. I am not sure what could be the issue. In another post there was a similar issue, but by choosing full length classifier issue was resolved, that's why I tried all 4 databases: 1) gg-13-8-99-515-806-nb-classifier.qza, 2) gg-13-8-99-nb-classifier.qza, 3) silva-132-99-nb-classifier.qza, 4) silva-132-99-515-806-nb-classifier.qza

Looking forward to hearing back from you!

Good afternoon,

Could you tell us a little more about your data set? When you say SE data, do you mean single end? When you say metagenomics, do you mean untargeted shotgun reads, or PCR amplicons targeting a single region like 16S v4?

I ask because different data types need to be analyzed differently in order to get the best results. I want to make sure that denoising with dada2 and classifying with skbio are the best choices for your data.

Thanks!
Colin

1 Like

Hello Colin,

Yes I meant single end Illumina sequencing. It was amplicon experiment of 16S rRNA. So I used the amplicon processing steps.

Appreciate any help to resolve the issue.

Thanks.

Ah ok! DADA2 should be a good fit for 16S rRNA sequencing, for both paired and single end sequencing.

Iā€™m not sure why the taxonomy assignment is not working well. What script did you use for taxonomy assignment? Hopefully the qiime devs can provide some feedback to help you out!

Colin

2 Likes

Hi @nounou,

A couple of thoughts. First, just so you know the gg-13-8-99-515-806-nb-classifier.qza and silva-132-99-515-806-nb-classifier.qza classifies you used would only work if you used the same primers as was used to create those 2 classifiers, i.e. the 515F/806R primer set. If you are using another primer set then donā€™t even bother using those 2. Your best bet would be a) to either train your own classifier or use the pre-trained one with the full sequences (which you already have).

As per @colinbrislawnā€™s suggestion, providing us some more detailed information about the steps youā€™ve taken so far will greatly help with the troubleshooting. Are 100% of all your reads assigned to only Bacteria or do some reads get assigned to something else? From experience, most often the issue of non-specific assignments come down to either improper reference database usage or failure to remove non-biological sequences at the denoising step. Iā€™m going to guess in your case it is the latter.
Can you check to make sure that prior to DADA2 your barcodes, primers, or any other non-biological sequences have been removed from your reads? If any of those are kept then your features will fail to hit beyond domain level.
Let us know and weā€™ll go from there.

2 Likes

@Mehrbod_Estaki thank you for the reply.

Yes, Iā€™ve tried the whole databases as well, so choice of database shouldnā€™t be the issue. One important note is that these are custom primers and not all my ~18M reads has forward primers attached to them (counting the primers in the runā€™s fastq delivered files). Prior to running qiime2, the barcodes are trimmed, but not the primers. I assumed having primers still attached is OK, since Iā€™ve seen papers doing so, and Iā€™ve been able to run experiments with primers not trimmed. After the denoise by DADA2, I have ~27K rep seq, which ~11K still have primers attached. So, DADA2 was able to select rep seqs from both with primer and w/o primer reads. There are not that many rep seq selected, do you think this low number of selected seq is OK? also,

1- if only < 25% of reads had primers attached after sequencing, is it still possible to detect taxonomic levels?
2- among 27K reads after DADA2, shouldnā€™t at least some of them be assigned to lower levels, e.g. class and all the way to species?
3- not supplying a meta data file could cause an issue like this? (I did not supply meta data)

Here are my steps:

qiime tools import
ā€“type ā€˜SampleData[SequencesWithQuality]ā€™ \

qiime dada2 denoise-single \

qiime feature-table summarize \

qiime feature-table tabulate-seqs \

qiime alignment mafft
qiime alignment mask
qiime phylogeny fasttree
qiime phylogeny midpoint-root \

qiime feature-classifier classify-sklearn \

Thanks a lot in advance.

This is actually a problem for DADA2. Having non-biological sequence in your reads will trip up the denoising process.

Since your primers are only sometimes attached, you'll need to use something like q2-cutadapt to remove them when present.

Also I'm a little concerned that not all of your data was sequenced in the same way. Where the same primers always used? Where there multiple runs with different configurations? (Each run should be denoised independently, and then merged.)

Hello @ebolyen,

I will re-do it with cutting primers and then using DADA2 to see if that can solve the issue.

There are different runs, so there were technical reps available. The primers were custom primers and many reads didnā€™t have those attached after the sequencing run finished. I pooled the tech reps and then went through the pipeline, but didnā€™t trim primers mostly because those werenā€™t present in all reads and also the past experience. What surprises me is that dada2 can picked rep seq from reads with and without primers, but then classifiers cannot assign either kind of reads (with or without primers) to any taxonomic levels. So, this makes me think that there should be another issue there, rather than primers.

Hi @nounou,
As @ebolyen mentioned, the most likely culprit here is the non-biological portion of your primers intact. That is to say the the overhang portion of the primer that is used to bind to barcodes/spacers etc, the portion of the primer that is part of the actual read shouldnā€™t cause a problem.

You mentioned that you had 27K representative sequences, not sure what your expected community diversity is but that is pretty high! In fact, suspiciously highā€¦ Which again might just be reflected in those non-biological sequences. Remember that rep-seqs are just the number of unique features identified in your samples, it doesnā€™t hold any information regarding their abundances.

I think the difference in your previous experience and with DADA2 is in the nature of how OTU vs ASVs are selected.
DADA2 is first building an error model based on a portion of all your samples regardless of what you have trimmed or what you left in. Then it will denoise your reads based on that error model, again not caring what you left in it or not. So if you have some reads that are the same taxa but with different primers on the 5ā€™, or in your case without the primers; for ex.

no-primer-AAACCCGGG
primerA-AAACCCGGG
custom_primer-AAACCCGGG

These will end up as 3 separate unique features in your rep-seqs table. Even though they should be the same feature. This is because the non-biological primers you left in there are making them different than each other. This would be true even if a single bp was different between them. Thatā€™s why they are referred to as exact sequence variants.
In contrast, OTU picking might have clustered them together into 1 OTU if they were within lets say 97% similarity of each other.

When it comes to assigning taxonomy, since those non-biological sequences donā€™t exist in the reference databases, the classifier canā€™t assign them to any single species, instead just leaves them as Bacteria since that is the closest assignment it could get. With OTU picking, this wasnā€™t an issue since we would have been much more lenient since up to 3% of error was still good enough to assign it to something.
To be sure, the DADA2 (and other denoising methods) are much more accurate and should replace OTU methods in most cases, including yours.
So, as per @ebolyenā€™s suggestion, trim those non-biological sequences out, re-run dada2 and try assigning taxonomy again. You should see less ASVs in your rep-seqs and your taxonomy should be much more accurate. We hopeā€¦if not weā€™ll have to start troubleshooting elsewhere:P

The second issue of importance here is, (again as @ebolyen mentioned) is that these samples are not from the same run. DADA2ā€™s error model building step is specific to a single run, so you should run the samples from the same run together then merge them later. Now, the merging of samples from different primers is a whole other topic on its own which has been discussed on this forum quite a bit. I recommend you do some searching and reading on the forum on that topic before merging your samples to make sure it is in fact appropriate in your case.
Keep us posted!

1 Like

Hello @Mehrbod_Estaki,

Thank you for the through and useful reply. Regarding the primers, I think you and @ebolyen were talking about >= 170bp sequences that are the whole primer, and I was talking about the only < 20bp that is part of the read ("CGCACAAGCGGTGGAGCAT" custom primer). That long primer is already trimmed and what's left on the reads are just these short pieces. As you mentioned:

those should be OK and I think that's why in my past runs those weren't an issue.

As a test, I used only 2 input files, from the same lane and same experiemnt. I ran it 1) without trimming these short peices of primers, 2) with trimming the primers. I used the exact same steps. In these 2 runs, I supplied a meta data file as well. Both runs finished successfully and generated taxonomic levels :grinning: hooray! the only difference here was the meta data file and I'm surprised to see that it solved the issue. I started the same 32 file run now, with a meta data and no trim, and I'll post here how that will go.

In these 2 file test runs, I've noticed that trimming the primers makes the quality scores deep sharply at the end. The number of rep seqs are 644 for no-trim and 1900 with trim. One thing that makes this analysis difficult is that in the sequecing fastq files, there are 20-30% reads with that short primer where the lab expects them all to have that. I'm not sure, but that could be part of the underlying issues.

Thanks again and I'll keep you posted about the current run.

Hi @nounou,
I'm unclear as what you are referring to when you say long primer vs short pieces..
Have a look at this slide/image as to what we are referring to, which is the black boxes usually about 20 bp or so. The locus-specific portion of the primer is what binds to your target (black), while the overhang adaptor (green) which is usually used to combine your primer to barcodes is non-biological, meaning it has no biological information and is purely used for technical purposes.
So, all we are saying is make sure that the overhang adaptor (green), and anything else before it that may still be in your reads are trimmed/removed.

Again, I'm not sure what you mean by this. Just have a look through some of your raw fastq files, say using the head or tail command and just check to see if the non-biological sequences have been removed.

Glad you got your taxonomic assignments! Could you please elaborate what you mean by using the meta datafile?, this would help trouble-shoot future cases like this.

The representative sequences obtained from dada2/deblur doesn't depend on a metadata file, nor is one required for assigning taxonomy using a classifier, so I'm curious as what you mean by this was the difference..

This is a bit surprising to me, but without knowing exact details of your primers and reads and what your dada2 parameters were in each run it's hard to infer anything meaningful from that, at this point.
Let's see how your second run goes and we can go from there!

Hello @Mehrbod_Estaki,

Thank you for the slides, having the color codes helps me explain what I meant better. By long primers I meant the green box, which is already trimmed. I should have used ā€œadaptersā€ term, and not long primers, sorry, didnā€™t mean to make any confusions.

About the 20 or so bp primers (black box), those are still on my reads, which I think it should be OK. However, there are random number (<9 random bps) at the beginning of the reads right before the primer, which are the Heterogeneity Spacer (HS). I have not trimmed those. I think this must have caused the very large number of rep seqs. Now, I have 2 questions:

1- should I trim those HS bases only and leave the primers (black box) on my reads? or primers should be gone too?
2- if I should trim just those HS bps, what is your suggestion for doing so? adapter trimming tools normally trim a given sequence and keep the rest of the bases. In this cases, I want to find a primer and trim any bases before it, which makes reads shorter by 3-9 bases. One way to do it is to use dada2 trim-left option and cut 9 bases from all reads, but that will make the primer left on reads have variable length. Since dada2 seems sensitive to the primer, Iā€™m concerned about this option. Or I can write my own code to only cut the 3-9 bases before the primer.

On other note, I was also surprised to see that adding a metadata file solved the taxonomic level issue, since in dada2 and even any other commands before making summaries and classification that file is not used. But Iā€™m just glad that providing it solved the issue, since the data and commands were exactly the same, just this file was added. (this is the metadata that can be validated by keemei or qiime metadata validate).

Thank and hope to hear back from you soon about trimming the HS bases.

Hi @nounou,
The heterogeneity spacers are wonderful to have in your design as they greatly increase quality in sequencing but they are a bit of a hassle to deal with bioinformatically. You certainly need to remove those as well since those are also biologically meaningless and lead to false new ASV picking.

The loci-specific part of the primer can technically be left in but I know others claim higher accuracy with their taxonomy assignment when those are removed. Those reads are more or less the same across all reads and don't provide any additional biological information and in fact they may reduce the accuracy of alignments later.

Unfortunately qiime2 doesn't have an easy method to deal with variable primers, and this has been brought up in other posts. You may want to search those and see if there were any solutions posted. So you can either find a way to do this outside of qiime2 or as you suggested trim a constant number of bp from your 5' that is larger than your longest HS lengths. The downside in this method is you are losing some resolution but also you will still create some artificial ASVs since the same exact taxa can have a number of different ASVs just based on how many bp were trimmed from it (I wouldn't suggest this method).
You may also be able to hack something with q2-cutadapt by running your data a bunch of times with each time searching and cutting one HS spacer. I have no idea if this will actually work, I've never tried it, just throwing some ideas around:P

Lastly, I'm still very confused about the whole metadata inclusion part. Neither denoising step with dada2 or the feature-classifier actions even have an option to include a metadata file, so I'm guessing there is something else you did different between the runs. Regardless, you'll have to re-do them anyways with HS spacers/primers removed so let's see how the new run turns out and go from there.

Sorry I didn't have any perfect solutions for you. Keep us posted!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.