Rarefying when having multiple runs

MiriamGorostidi · February 7, 2022, 9:11am

Hey you all!

I have a considerable number of samples sequenced in different runs (each sample is sequenced once, I don't have duplicates). But now I have some doubts about how to apply the rarefaction step.

How should I rarefy the different runs? I mean, If I want to publish my results, do I have to explain that I have used a different rarefaction for each run or how is it possible to do the rarefaction step of all the samples together?

Thank you in advance

timanix · February 7, 2022, 9:41am

Hello!
As I understood, you have samples from different runs, but they all were amplified with the same primers. If I am right, then I would suggest:

Denoise each run separately but with the same parameters. So you will need to determine all dada2 parameters based on all runs and run it for each run in separate steps.
Merge resulted feature tables and rep-seq files in one table

So you can use it for all types of analysis, including rarefaction.
Please, feel free to ask any additional questions (but if topic is different it is better to create a new question).

MiriamGorostidi · February 7, 2022, 12:06pm

Hello Timur!

Thank you so much for your rapid respond!

Yes, exactly! The samples were amplified with the same kit, but in different runs.

I have performed the dada2 step, separately, for each run, as you comment. However, I did not use the same parameters (trim and trunc) for every run, since I checked the demux graphic of each one for choosing the parameters. I firstly thought on using the same parameters, as you mention, so every read would have the same length, however, one of my runs have no really good quality reads, so if I decide to cut all the reads from all the runs in the same length, I will lose a lot of really good quality reads in the other runs. I hope I am explaining myself well enough...

In one run the quality falls really early, so the length used there would be shorter and when applying that trunc value on the other runs, I would lose reads whose quality has not drop yet.

Oh, ok! So I can merge the Dada2 output files and make the rarefaction step together! Perfect!

Finally, should I work with a taxonomic classification (.csv file from barplot) that has already been rarefied or rarefaction is just only for the normalization step so I can get alpha and beta diversity? I mean, the taxonomic tables I have to work with are the ones I get when applying my classification command on the rep-seq files, right? Without prior rarefaying..

Thank you!!

timanix · February 7, 2022, 12:25pm

That's perfect!

Not necessarily if overlapping region will still be big enough to merge the reads. You can check output stats to see whether you are loosing a lot of reads.

You do not need to rerun classification separately for rarefied table. You can use one you will create for rarefied feature table before rarefaction.

MiriamGorostidi · February 7, 2022, 2:07pm

Okey! The thing is that when I check in the stats files, I only retain the 30% of the reads, but this could be on account of the samples being from ITS regions.. does this make sense?

I'm so sorry but I am really confused with the rarefaction step. I don't understand how this rarefaction has to be performed prior Taxonomic classification, but, instead, the input table for taxonomic step is not the output of the rarefaction step. I know that rarefying is done for normalization and that it would help us to identify the outlier samples, right? So, when I look at the table.qzv output from Dada2 and choose the sampling-depth (based on feature counts as mentioned in the tutorials) I need to use... if I see that 2 samples would be out with that parameter... should I maintain/conserve them for the following taxonomic analysis? Or just remove them?

And with alpha rarefaction... What happens if I see that some of my samples don't get to plateau state? Should I analyze them?

So sorry.. but I'm really confused...

timanix · February 7, 2022, 3:20pm

It is recommended not to truncate at all when processing ITS region due to high length variability. Could you try to run it without truncating to see if it will improve your stats?

Taxonomy classification can be performed before rarefaction. Rarefaction - kind of data normalization before diversity metrics calculation. You do not need to use rarefied table for all analyses unless it is specified so. You need to choose sampling depth as a compromise between number of samples and sequencing depth.

You should orient on most of your samples. No need to remove samples that are not reaching plateau.

Oh, I am still confused with most of my analyses

timanix · February 9, 2022, 8:00am

Hi @MiriamGorostidi
Unfortunately, we had some issues with the forum and support team restored ot from the backup, so we lost your last answer. Luckily, I have it in my email box, so I will post it again.

MiriamGorostidi:

Yes, of course!

Here is my command, which I have apply for each of the 3 runs I told you I had:

*I have used dada2 denoise-pyro, feature-classifier classify-consensus-vsearch and UNITE as database (since in one post of the forum those were the options recommended for Ion torrent sequences):
#!/bin/bash
#Define the general directory path in where we will do the analysis:
DIR="home/desktop/Mycobiota/"

#Move to the directory path:
cd ${DIR}

################################### DADA 2 ##############################################
mkdir Dada2_output
####### DADA 2 Pyro without Trunc Len  ###########

qiime dada2 denoise-pyro \
  --i-demultiplexed-seqs ${DIR}/samples.qza \
  --p-trunc-len 0 \
  --p-trim-left 15 \
  --p-n-threads 2 \
  --o-representative-sequences ${DIR}/Dada2_output/rep-seqs-pyro-noTrun.qza \
  --o-table ${DIR}/Dada2_output/table-pyro-noTrun.qza \
  --o-denoising-stats ${DIR}/Dada2_output/stats-dada2-pyro-noTrun.qza \
  --verbose

#Visualizing Resulting data...
#Converting DADA2 artifact .qza to .qzv
qiime feature-table summarize \
  --i-table ${DIR}/Dada2_output/table-pyro-noTrun.qza \
  --o-visualization ${DIR}/Dada2_output/table-pyro-noTrun.qzv \
  --m-sample-metadata-file ${DIR}/samples-metadata.tsv
qiime feature-table tabulate-seqs \
  --i-data ${DIR}/Dada2_output/rep-seqs-pyro-noTrun.qza \
  --o-visualization ${DIR}/Dada2_output/rep-seqs-pyro-noTrun.qzv
qiime metadata tabulate \
  --m-input-file ${DIR}/Dada2_output/stats-dada2-pyro-noTrun.qza \
  --o-visualization ${DIR}/Dada2_output/stats-dada2-pyro-noTrun.qzv

################################### IMPORTING DataBases and creating Classifiers ###################################
######## UNITE DB #################
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path ${DIR}/ITS_UNITEdatabase/sh_qiime_release_s_all_04.02.2020/sh_refs_qiime_ver8_dynamic_s_all_04.02.2020.fasta \
--output-path ${DIR}/ITS_UNITEdatabase/unite_dyn_refs.qza

qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path ${DIR}/ITS_UNITEdatabase/sh_qiime_release_s_all_04.02.2020/sh_taxonomy_qiime_ver8_dynamic_s_all_04.02.2020.txt \
--output-path ${DIR}/ITS_UNITEdatabase/unite_dyn_taxa.qza

qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ${DIR}/ITS_UNITEdatabase/unite_dyn_refs.qza \
--i-reference-taxonomy ${DIR}/ITS_UNITEdatabase/unite_dyn_taxa.qza \
--o-classifier ${DIR}/ITS_UNITEdatabase/unite_classifier.qza

################################### TAXONOMIC CLASSIFICATION ##############################################
mkdir Taxonomic-Analysis-vsearch
############### feature-classifier classify-consensus-vsearch ######################
###############UNITE DADA 2 Pyro without Trunc Len ######################
qiime feature-classifier classify-consensus-vsearch \
 --i-query ${DIR}/Dada2_output/rep-seqs-pyro-noTrun.qza \
 --i-reference-reads ${DIR}/ITS_UNITEdatabase/unite_dyn_refs.qza \
 --i-reference-taxonomy ${DIR}/ITS_UNITEdatabase/unite_dyn_taxa.qza \
 --o-classification ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qza

qiime metadata tabulate \
  --m-input-file ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qza \
  --o-visualization ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qzv

qiime taxa barplot \
  --i-table ${DIR}/Dada2_output/table-pyro-noTrun.qza \
  --i-taxonomy ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qza  \
  --m-metadata-file ${DIR}/samples-metadata.tsv \
  --o-visualization ${DIR}/Taxonomic-Analysis-vsearch/taxa-bar-plot-pyro-noTrun-unite-vsearch.qzv
When I say that I have identify more taxonomic classifications, this is what I have done:

Once I run the previous script for each run and the taxonomic classifications of the 3 runs have been obtained, I download the .csv files from taxa barplot (.qza file) in level-6.csv, and calculate the relative values. Using that table, I have done the count of how many generas have a relative value >0 for each sample, comparing the truncated and untruncated results. Like that, our main aim is to compare which of the methods obtains more classification. (I hope this approach makes sense)

Is it necessary to upload the .qza files? This analysis is part from a project that I am not really sure if can be made public currently..

Instead, here are the .csv files I am working with:
level-6-fungi-TRUN-PGM240.csv (19.7 KB)
level-6-fungi-noTrun-PGM240.csv (17.7 KB)

level-6-fungi-TRUN-PGM239.csv (17.0 KB)
level-6-fungi-noTrun-PGM239.csv (14.0 KB)

level-6-fungi-noTrun-PGM235.csv (35.7 KB)
level-6-fungi-TRUN-PGM235.csv (32.2 KB)

I get the next results:
PGM 240 run: 15/22 samples identify more taxonomies with trunc and 7/22 more without trunc.
PGM 239 run: 21/23 samples identify more taxonomies with trunc and 2/23 more without trunc.
PGM 235 run: 15/30 samples identify more taxonomies with trunc and 15/30 more without trunc.

When merging all the results for all the samples in one table, I get that 51 samples classify/identify more taxonomies when truncating and 24 when no truncating (moreover, almost all the negative controls of the different runs are inside this 24 and are not actually samples).

I hope I have explained well

MiriamGorostidi · February 9, 2022, 9:17am

Oh ok! Did not know anything about those issues! Thank you for posting it again

Do you have some clue about what should I do regarding those parameters in my analysis?

timanix · February 9, 2022, 9:23am

Just got an excellent hint from @llenzi regarding your case. I will quote him for you.

llenzi:

In this case the op is using PGM sequences, which may vary on length. The denoise-pyro is comparable to denoise-single, working on the full length. So it may be a good things to truncate all the sequences at the same length. The usual suggestion on not using truncation is for denoised-paired, when given the different length of ITS you could loose all samples at the merging step.
On the low quality run, I probably would try to use this to select the cut-off. Applying a shorter cut-off to the other runs should reduce the length of these sequences not the number (at least is my understanding), but of course they may loose resolution in taxonomy.
I never use PGM sequences and i am not familiar with its library preparation. What I don't know is if trimming 15 bases get rid of all the artificial sequences in the reads. Also, I wonder if using itsxpress would be helpful to normalize the sequence undergoing to taxonomy assignment at the same region in the study.
When they say 'identify more taxonomies' what they meant? Less unclassified sequences or more sequences classified at species level?
I would try blast-classifier instead of sklearn, a local alignment approach could give more coherent result across run, if they still have spurious/low quality sequences in there?
On the rarefaction step, I agree with what you said, it is possible to perform it with the rep-seqs output from dada2 as most large sequences dataset and use it along all the analysis. The different ASV tables will do the selection for the relevant sequences in each case. However, they may want to compare the run after the rarefaction step, using it as kind of normalization among runs, I think.
Looking at their files, they are looking at level 6 only, they probably should look at level 4 and 5 too, to get a full idea. Form this results I would gather that trimming sequences reduce the low quality sequences, with a beneficial effect on the taxonomic resolution at genus level (at least).

timanix · February 9, 2022, 10:45am

One additional moment. earlier you wrote, that you lost about 70% of the reads by truncating sequences. Did you try to set lower truncating parameter? It can help you to increase the number of sequences at the end since all sequences that are lower than this number will be filtered out.

MiriamGorostidi · February 11, 2022, 11:46am

Thank you so much for your time!!

Let's say then that it is recommendable to trunc the sequences, especially truncating all of them at the same length (so I will try the analysis again using the same trunc parameter for the 3 runs, trying to maintain as much reads as I can in the lower quality sequenciation).

Yes, using PGM sequences is a bit rare, but it is the sequencer we have available right now... I get to the point that trimming 15 bases was enough in this discussion in the forum:

I suppose it would enough. However, I now have another analysis, where I know that my primers are 20nucleotides long, so, should I trim the sequences in 20?
Besides that, what is itsxpress? I have not heard about it yet.

I have count, for each sample, how many different taxonomies are identified (counts>0) in genera level with and without trunc. (Maybe this is not the best method...)

About the classifier, I have tried different ones, but I finally decided to use the feature-classifier consensus-vsearch, based on the previous Forum discussion. Should I give a try to blast?

This makes me more clear the rarefaction step, so thank you so much for the explanation and the discussion!
Best!!

llenzi · February 11, 2022, 1:29pm

Hi @MiriamGorostidi
Just to qiime in on coupe of point, @timanix please add anything you feel is missing!

On ITSxpress, please see:

There is a q2-itsxpress plug in at:
https://library.qiime2.org/plugins/q2-itsxpress/8/

Pleas note, you may need an older version of qiime2 to work with this, I am not sure.
There are few threads already on this.
A possible full tutorial is:

It basically extract the ITS region removing any extra bases around, which may be useful to clean up the sequences for the taxonomy step.

On the classifier, yes I woul dgive a tray using blast classifier on your current rep-seq. My understanding of vsearch is that still uses a global alignment, in which query and target need to match from head-to-tail, while blast will focus on matching possible sub-sequences of your rep set on the sequences in the database.
Good luck and let us know any progress.

Cheers
Luca

MiriamGorostidi · March 1, 2022, 8:48am

Hi @llenzi

Thank you for your explanation!

I spent some time doing a research about its-xpress function and it seems that the command is not prepared for Ion torrent sequences.

I finally decided to analyze each run on its own and then merge all the taxonomic results and work with them together. I'm not totally sure about this but I don't know how to manage it instead.

What I am doubting the most is the rarefaction parameter. If I would like to publish the results, I should clarify where I have rarefied the reads, but, if I have done the analysis separately, how I am supposed to explain this?

Thank you so much in advanced

llenzi · March 1, 2022, 9:44am

Hi @MiriamGorostidi,

great detective work so far, I did not know that ITSxpress is not meant for IonTorrent sequences.

I think your plan should work, and more or less your pipeline will be the following:
I) Denoise each run with dada2 pyro, using the same parameters
II) Merging the obtained feature tables
III) Before merging the rep-seq for each run, reorient the sequences using your database as reference
This is important because sklearn expect all reads in same orientation and it can be confused if it is not the case. You can use RESCRIPt plug in to do that (please look at 'orient sequence by alignment to reference' in the tutorial Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt )
IV) Plot the rarefaction curve using the merged feature table, and use it to define the rarefaction threshold comparing all the samples. That should avoid any question form the reviewer on your doubt ;).

Oh I forgot, all the above after removing the PCR primers using cutadapt!
Let us know how is going!
Cheers
Luca

MiriamGorostidi · March 3, 2022, 1:30pm

Perfect!

However, I am using vsearch classification method, no sklearn. So, should I still use RESCRIPt?

Thank you!

I will let you know!

llenzi · March 3, 2022, 2:38pm

Hi @MiriamGorostidi,
I suggest to still to use RESCRIPt, to get thing easier. Using vsearch you are using a OTUs clustering like approach. I honestly don't remember if vsearch look for similarities in both directions. So in doubt, I'd say still use RESCRIPt to reorient the sequences.

Cheers
Luca

llenzi · March 3, 2022, 4:13pm

Hi @MiriamGorostidi
I got a tip for @Nicholas_Bokulich, among vsearch options there is one that specify to search in both directions!
Specifically, the option is '--p-strand both'

Best,
Luca

MiriamGorostidi · March 7, 2022, 9:05am

Perfect! Thank you, I will add the --p-strand both option to my code!