Rarefying when having multiple runs

Hey you all!

I have a considerable number of samples sequenced in different runs (each sample is sequenced once, I don't have duplicates). But now I have some doubts about how to apply the rarefaction step.

How should I rarefy the different runs? I mean, If I want to publish my results, do I have to explain that I have used a different rarefaction for each run or how is it possible to do the rarefaction step of all the samples together?

Thank you in advance :slight_smile:

As I understood, you have samples from different runs, but they all were amplified with the same primers. If I am right, then I would suggest:

  1. Denoise each run separately but with the same parameters. So you will need to determine all dada2 parameters based on all runs and run it for each run in separate steps.
  2. Merge resulted feature tables and rep-seq files in one table

So you can use it for all types of analysis, including rarefaction.
Please, feel free to ask any additional questions (but if topic is different it is better to create a new question).


Hello Timur!

Thank you so much for your rapid respond!

Yes, exactly! The samples were amplified with the same kit, but in different runs.

I have performed the dada2 step, separately, for each run, as you comment. However, I did not use the same parameters (trim and trunc) for every run, since I checked the demux graphic of each one for choosing the parameters. I firstly thought on using the same parameters, as you mention, so every read would have the same length, however, one of my runs have no really good quality reads, so if I decide to cut all the reads from all the runs in the same length, I will lose a lot of really good quality reads in the other runs. I hope I am explaining myself well enough...

In one run the quality falls really early, so the length used there would be shorter and when applying that trunc value on the other runs, I would lose reads whose quality has not drop yet.

Oh, ok! So I can merge the Dada2 output files and make the rarefaction step together! Perfect!

Finally, should I work with a taxonomic classification (.csv file from barplot) that has already been rarefied or rarefaction is just only for the normalization step so I can get alpha and beta diversity? I mean, the taxonomic tables I have to work with are the ones I get when applying my classification command on the rep-seq files, right? Without prior rarefaying..

Thank you!!

That's perfect!

Not necessarily if overlapping region will still be big enough to merge the reads. You can check output stats to see whether you are loosing a lot of reads.

You do not need to rerun classification separately for rarefied table. You can use one you will create for rarefied feature table before rarefaction.

1 Like

Okey! The thing is that when I check in the stats files, I only retain the 30% of the reads, but this could be on account of the samples being from ITS regions.. does this make sense?

I'm so sorry but I am really confused with the rarefaction step. I don't understand how this rarefaction has to be performed prior Taxonomic classification, but, instead, the input table for taxonomic step is not the output of the rarefaction step. I know that rarefying is done for normalization and that it would help us to identify the outlier samples, right? So, when I look at the table.qzv output from Dada2 and choose the sampling-depth (based on feature counts as mentioned in the tutorials) I need to use... if I see that 2 samples would be out with that parameter... should I maintain/conserve them for the following taxonomic analysis? Or just remove them?

And with alpha rarefaction... What happens if I see that some of my samples don't get to plateau state? Should I analyze them?

So sorry.. but I'm really confused...

It is recommended not to truncate at all when processing ITS region due to high length variability. Could you try to run it without truncating to see if it will improve your stats?

Taxonomy classification can be performed before rarefaction. Rarefaction - kind of data normalization before diversity metrics calculation. You do not need to use rarefied table for all analyses unless it is specified so. You need to choose sampling depth as a compromise between number of samples and sequencing depth.

You should orient on most of your samples. No need to remove samples that are not reaching plateau.

Oh, I am still confused with most of my analyses :grimacing:

Hi @MiriamGorostidi
Unfortunately, we had some issues with the forum and support team restored ot from the backup, so we lost your last answer. Luckily, I have it in my email box, so I will post it again.


Oh ok! Did not know anything about those issues! Thank you for posting it again :slight_smile:

Do you have some clue about what should I do regarding those parameters in my analysis?

Just got an excellent hint from @llenzi regarding your case. I will quote him for you.

One additional moment. earlier you wrote, that you lost about 70% of the reads by truncating sequences. Did you try to set lower truncating parameter? It can help you to increase the number of sequences at the end since all sequences that are lower than this number will be filtered out.

Thank you so much for your time!!

Let's say then that it is recommendable to trunc the sequences, especially truncating all of them at the same length (so I will try the analysis again using the same trunc parameter for the 3 runs, trying to maintain as much reads as I can in the lower quality sequenciation).

Yes, using PGM sequences is a bit rare, but it is the sequencer we have available right now... I get to the point that trimming 15 bases was enough in this discussion in the forum:

I suppose it would enough. However, I now have another analysis, where I know that my primers are 20nucleotides long, so, should I trim the sequences in 20?
Besides that, what is itsxpress? I have not heard about it yet.

I have count, for each sample, how many different taxonomies are identified (counts>0) in genera level with and without trunc. (Maybe this is not the best method...)

About the classifier, I have tried different ones, but I finally decided to use the feature-classifier consensus-vsearch, based on the previous Forum discussion. Should I give a try to blast?

This makes me more clear the rarefaction step, so thank you so much for the explanation and the discussion!

Hi @MiriamGorostidi
Just to qiime in on coupe of point, @timanix please add anything you feel is missing!

On ITSxpress, please see:

There is a q2-itsxpress plug in at:

Pleas note, you may need an older version of qiime2 to work with this, I am not sure.
There are few threads already on this.
A possible full tutorial is:

It basically extract the ITS region removing any extra bases around, which may be useful to clean up the sequences for the taxonomy step.

On the classifier, yes I woul dgive a tray using blast classifier on your current rep-seq. My understanding of vsearch is that still uses a global alignment, in which query and target need to match from head-to-tail, while blast will focus on matching possible sub-sequences of your rep set on the sequences in the database.
Good luck and let us know any progress.


1 Like

Hi @llenzi

Thank you for your explanation!

I spent some time doing a research about its-xpress function and it seems that the command is not prepared for Ion torrent sequences.

I finally decided to analyze each run on its own and then merge all the taxonomic results and work with them together. I'm not totally sure about this but I don't know how to manage it instead.

What I am doubting the most is the rarefaction parameter. If I would like to publish the results, I should clarify where I have rarefied the reads, but, if I have done the analysis separately, how I am supposed to explain this?

Thank you so much in advanced :slight_smile:

Hi @MiriamGorostidi,

great detective work so far, I did not know that ITSxpress is not meant for IonTorrent sequences.

I think your plan should work, and more or less your pipeline will be the following:
I) Denoise each run with dada2 pyro, using the same parameters
II) Merging the obtained feature tables
III) Before merging the rep-seq for each run, reorient the sequences using your database as reference
This is important because sklearn expect all reads in same orientation and it can be confused if it is not the case. You can use RESCRIPt plug in to do that (please look at 'orient sequence by alignment to reference' in the tutorial Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt )
IV) Plot the rarefaction curve using the merged feature table, and use it to define the rarefaction threshold comparing all the samples. That should avoid any question form the reviewer on your doubt ;).

Oh I forgot, all the above after removing the PCR primers using cutadapt!
Let us know how is going!



However, I am using vsearch classification method, no sklearn. So, should I still use RESCRIPt?

Thank you!

I will let you know! :slight_smile:

Hi @MiriamGorostidi,
I suggest to still to use RESCRIPt, to get thing easier. Using vsearch you are using a OTUs clustering like approach. I honestly don't remember if vsearch look for similarities in both directions. So in doubt, I'd say still use RESCRIPt to reorient the sequences.


Hi @MiriamGorostidi
I got a tip for @Nicholas_Bokulich, among vsearch options there is one that specify to search in both directions!
Specifically, the option is '--p-strand both'


Perfect! Thank you, I will add the --p-strand both option to my code!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.