Follow up in one taxonomic level - trimming program

Hello @Mehrbod_Estaki ,

Hope all is well! I had to start a new topic since our previous conversation about taxonomic levels is closed (more than 30 days no reply).

Regarding the heterogeneity spacers (HS) bases, there weren’t any good programmers out there to take care of them, which was VERY surprising to me! so, I wrote my own trimmer and it’s on Github if anyone interested. Here is the link:

I trimmed any HS bases before the custom primer, but kept the primer on the reads (any reads without primers is also discarded).

I tried 64 trimmed samples, and I noticed that the run finished faster that my exception, so I think DADA2 benefited from the trim. Classifications give me down to 7th taxonomic level, so it’s working well.

However, in the table output of DATA2, I still have high “Number of features”. Here is how it looks like:

Number of samples: 64
Number of features: 29,394
Total frequency: 33,368,369 (total reads)

Is this number too high? I think it is possible to have high bacterial presence in my species, but just wanted to make sure this is not an alarming high/irrelevant number.

Thanks for your help!

1 Like

Hi @nounou,
Thanks for following up on this from the previous discussion.

Couldn't agree more, tough to find available tools for this but thank you very much for sharing your code! I suspect spacers are going to become more popular and the need for this kind of tool will be greater than ever. If I may recommend one thing, if you get some time, it may be useful for newer users if there was an example workflow in the readme, this would certainly increase the reach of your tool, and may help detected any potential issues/bugs as well.

What is the reasoning behind discarding those reads, and I'm curious as why some of these reads already don't have primers? I can't recall from our previous discussion. Why not keep keep all the reads and trim the primers on reads that have them. Keeping universal primers in your reads doesn't really add resolution to the data and in fact people have reported that classification is more accurate without them.

Great! That is to be expected with shorter reads. And glad you got better classification!

That really depends on your samples. For human/mice fecal samples, that number certainly seems on the high end of the spectrum. But perhaps as you mentioned that is normal in your insect data? What is surprising though is that when you removed these HS compare to when you hadn't removed them before, you actually end up with even more rep-seqs (you mentioned 27K in the previous post). This doesn't sit right with me as I would expect a fold drop in your # of rep-seqs roughly equal to the number of different HS you had. So I'm not sure why this number actually went up. Do most of your reads get taxonomic assignments? What portion of your reads are unassigned, unknown, or only assigned to Kingdom/Domain level?

Lastly, have you confirmed your tool is working properly and as expected? For example, what happens if the expected HS sequence is off by a single nt etc.

1 Like

Hello @Mehrbod_Estaki,

About the program, that’s a good suggestion, I’ll make some changes on the ReadMe to make it easier to use. About the location, yes, it is tested. Basically, the program asks for the primer sequence from the user, searches for that within the read and trims off anything before it. So, it doesn’t matter where the primer is.

The % of reads with no primer is very low (<4%, but mostly 2-3%), and I decided to discard those since talking to the lab, the primers should be attached to healthy reads. The ones with no custom primer might be sequencing artifacts. Of course other people are welcome to change my code and keep those reads, but I think keeping reads that are as close as possible to each other and with intended primer makes more sense.

Regarding the rep seq, in that example I had ~18M reads with 27K rep seqs. In the new run, I had ~33M reads with 29K rep seq, so it actually dropped after trimming HS. I didn’t mean to cause confusion, but I these are two different number of input data. Looking at the % of rep seq, it is lower after trimming HS. The reads are mostly assigned taxonomic levels down to species level, which is great. There aren’t similar studies so I am not sure if 29K rep seq out of 33M reads for this insect is OK or not, but we do see some expected species there, so I think this high number should be OK.

The other point is about keeping primers or trimming those. As you mentioned, there are studies using either approach. Not sure if I should trim the primers since all the reads have those and I assume those make no difference in classification. My programs brings the primer to the 1st base of each read (by trimming anything before it), so it should make a uniform format and reducing the effect of primers on classification results. What do you think?

BTW, is there a straightforward way of filtering # of reads assigned to each taxonomic level? I downloaded the CSV files from the taxa-barplot-silva-132-99.qzv and counted the bacterial assignment in CSV files. It didn’t make sense because if was ~700K, but I only have 29K rep seq. So, not sure how to summarize each taxonomic level.

1 Like

Hi @nounou,

:+1: I was just curious what happens when a read does not have any HS associated with it, or if its off by a single nt. Is that read discarded or kept as is? Because even a single nt difference leads to a new ASV. But I think the second part of your script which discards reads without the primers within, deals with that nicely by removing those spurious ASVs anyways.
When you set up an example workflow, feel free to share it in the community contributions section of this forum!

Fair enough! and 2-3% is very low anyways so not an issue.

Without a reference study it is hard to say but at this point there's no reason to doubt it either. How many taxa do those 29k unique ASVs collapse when you collapse them at the species level?

I personally haven't done any testing on this and I don't have a reference in mind either that actually shows this benchmarking, though I know it exists somewhere out there. If I understood correctly your program trims the user defined primer sequences out completely? If so that is totally fine, and arguably the best approach. If not, it's still fine :stuck_out_tongue:

Not sure what you mean by filtering, but I think what you want is your feature table and not the rep-seqs file. Rep-seqs is simply a list of the unique features observed in your feature-table and actually holds no information about the abundance of those taxa. The feature-table on the other hand, has abundance information. The csv file you downloaded from the barplots is that feature-table. So you have 700k reads and 29k unique taxa. To get the per-sample abundance information beyond what the barplot shows you'll just have to manually calculate what you want from that feature-table.
Note, you can also collapse your feature table down to any taxonomic level using the collapse action instead of downloading it from the barplot visualization.
Good luck!

1 Like

Hi @Mehrbod_Estaki,

At this time, the program keeps the primer provided by the user, and discards anything before it (HS bases). At the results all the reads have these primer at the beginning. It looks for the exact primer, so even if one nt is off that read will be discarded.

Sure, I’ll add it to the community contribution section.

About the summary, I’m not sure if feature-table can make summary that I want. The “qiime feature-table summarize” commands uses the table generated by dada2, and has denoise and sequencing information. What I am looking for is filtering/summarizing the infromation after the classifications, e.g. the data presented in " taxa-bar-plots-gg-13-8-99.qzv". For example how many level 7 do I get for each group (provided in metadata file). I downloaded the CSV file from that taxta-bar-plot page hoping to have a clear answer, but those numbers don’t add up correctly (I think!). I couldn’t fine the “collapse” function that you metioned. I appreciate if you can help me learn how to summarize the taxonomic levels for each group.

Thank you!

1 Like

Just posted about the trimming program on Community Contribution section.

1 Like

Hi @nounou ,

Sorry about the delay in replies. I'm currently out of the country on holidays with not so great internet connection.

Thanks so much for sharing this! :tada: I'm sure this will be useful for others using these spacers.

So the feature table you get following DADA2 is ASVs and does not have any taxonomic assignments. If you use the collapse function I linked in my previous post, you can use that DADA2-derived ASV table in conjunction with a classifier (which you'll either have to make or use a pre-made one) to create a new feature-table that has taxonomy assignments at whatever level you want (i.e. 7/species) instead of just hashed IDs. You can use this new taxa-table in stead of the ASV-table for all downstream analysis, so you can use it as input in feature-table summarize etc.

Was there something wrong with the link I posted above? I just tried it and it takes you to the plugin page.

Can you clarify what you mean by they don't add up?
That table will have taxa counts (columns) by samples (row) and so you can use that information however you want. Note that these numbers are not relative abundances as the actual barplots.

This csv file is actually pretty similar to the table you would get using the 'collapse' function. Except the one you download from barplots will have extra columns holding metadata information, and its not ready to be used in other downstream qiime2 analyses since its in csv format.


I'm not sure what information exactly you are looking for, but if the above doesn't get you to what you need, can you provide an example of what you are trying to get out of your data and we'll try and get you there.

1 Like

Hi @Mehrbod_Estaki,

Hope you had a nice trip and Happy New Year.

I think it's easier to say what I need with a screenshot. In the included image, I want to summarize % of a certain species (level 7). In the GUI, I cannot see an option to summarize the populations of each species.

One more thing, the order of the samples that I provided in the metadata is changed on this plot. How can I sort those again back to how I provided them in the metadata file?

Also, if I have 2 sequencing run, after running DADA2 for each run separately, then I can sefely merge those outputs and run the rest of the analysis, correct?


1 Like

Hi @nounou,
Thanks and to you as well!

In the barplot if you hover the cursor over the barplots it will give you the % of that species within that sample. That is about the extend of summary you can get with that graph for each species. If you convert your taxonomy-assigned feature-table to relative frequency using the relative-frequency plugin, then you can extract a csv file of your data in relative frequency and use whatever tool you want to summarize them further based on your metadata categories. So if you want a statistics summary of the taxa, for example, mean +/-SE of a species across all control samples you'll have to do that yourself outside of Qiime2.

In the barplots graph, on the top right hand side there is a "Sort Samples by" drop down menu that you can use to re-sort your samples using whatever nomination you want. You can also click on the blue '+' icon and add extra sorting levels.

Only if you used the exact same primers and the same trim/truncating parameters when you were denoising. There are several other threads on the forum that discuss how to merge multiple DADA2 derived feature-tables and why the parameters need to be the same. The FMT tutorial also shows an example of how this should be done. Have a quick search through the forum and that tutorial and let us know if you have any questions!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.