Next steps regarding ASVs or OTUs?

fblues · October 30, 2018, 7:17pm

Thanks for the kind responses.
I found that I have 30 samples contained in distinct directories.
So, I put those into a directory and ran the import command with
an extended manifest file:
sample-id,absolute-filepath,direction
91,$PWD/91/159-78_S76_L001_R1_001.fastq,forward
91,$PWD/91/159-78_S76_L001_R2_001.fastq,reverse
92,$PWD/91/159-79_S77_L001_R1_001.fastq,forward
92,$PWD/91/159-79_S77_L001_R2_001.fastq,reverse
....
119,$PWD/91/160-11_S10_L001_R1_001.fastq,forward
119,$PWD/91/160-11_S10_L001_R2_001.fastq,reverse
120,$PWD/91/160-12_S11_L001_R1_001.fastq,forward
120,$PWD/91/160-12_S11_L001_R2_001.fastq,reverse

Then, summarized it into a gzv format as follows:
demux_A4.qzv (287.7 KB)

After this, I wanted to denose the sequences to harvest OTU (or feature table?). So, I followed dada2 procedure:
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux_A4.qza
--p-trim-left-f 13
--p-trim-left-r 13
--p-trunc-len-f 250
--p-trunc-len-r 250
--o-table table_A4.qza
--o-representative-sequences rep-seqs_A4.qza
--o-denoising-stats denoising-stats_A4.qza

denoising-stats_A4.qzv (1.2 MB)

It took a long time but I think I made a mistake on choosing the parameters or other parts.
Because when I run the code for clustering:
qiime vsearch cluster-features-open-reference
--i-table table_A4.qza
--i-sequences rep-seqs_A4.qza
--i-reference-sequences 85_otus.qza
--p-perc-identity 0.85
--o-clustered-table table-or-85.qza
--o-clustered-sequences rep-seqs-or-85.qza
--o-new-reference-sequences new-ref-seqs-or-85.qza

It finishes within 3 mins. But, it supposed be much longer from the tutorial.

So, These were things I have wrestled with so far.
I am still dabbler in this area; so, I might be using awkward wordings here and there.
But, if you could give me a small clue for the right direction, I would be very appreciated!

Mehrbod_Estaki · October 31, 2018, 8:44pm

Hi @fblues,
No worries about the terminology, you'll pick them up in no time! Also, thanks for providing your output files, those are very helpful!

There are several things we need to consider in your situation.
First, are these 30 samples from the same sequencing run? With dada2 you want to denoise samples from the same run together and shouldn't combine samples from multiple runs. This isn't an issue with OTU picking methods.
OTU picking is fundamentally different than ASV creation using DADA2/Deblur, so we can't really compare them as they do very different things. I would recommend to stick with denoising methods unless you have a very specific reason to do OTU picking. That being said, I would expect dada2 to take longer than open-reference OTU picking but 3 min is pretty quick. This is because in your OTU picking method you are using the 85% reference database. This is a much smaller database and shouldn't be used. I'm guessing you used this since it was in the tutorial but see this blue note-box in the tutorial explaining why 85% should NOT be used with real-life data, and this was only used for training purposes. If you must use OTU picking, choose the higher % database such as 97 or 99%.

The stats-summary you provided from your dada2 shows the majority of your reads being filtered before even being denoised. I'm guessing this is because the quality of your reverse reads are rather poor and are forcing whole reads to be dropped. My suggestion is to abandon your reverse reads and just use your forward reads since they are in much better condition and so you will end up with much higher sequences/samples.
As for DADA2 parameter picking, this topic has been exhaustively discussed on the forum so have a quick search and read through those to get a better sense of how to pick those. If you have any further questions on top of those we'll be happy to help!
Good luck and keep us posted.

fblues · November 1, 2018, 1:47pm

Thank you for the inputs @Mehrbod_Estaki !!

Regarding the sample issue with sequencing run,
I have to ask the provider. Will make it sure.

At this point, I am using QIIME2 to create a sparse data table
which contains the frequencies of microbiomes.
It is because my colleagues use R for the analysis.
After I can extract it, I am planning to analyze data by using QIIME2 for myself.

Mehrbod_Estaki:

OTU picking is fundamentally different than ASV creation using DADA2/Deblur, so we can’t really compare them as they do very different things. I would recommend to stick with denoising methods unless you have a very specific reason to do OTU picking. That being said, I would expect dada2 to take longer than open-reference OTU picking but 3 min is pretty quick. This is because in your OTU picking method you are using the 85% reference database. This is a much smaller database and shouldn’t be used. I’m guessing you used this since it was in the tutorial but see this blue note-box in the tutorial explaining why 85% should NOT be used with real-life data, and this was only used for training purposes. If you must use OTU picking, choose the higher % database such as 97 or 99%.

I also tried to use "97_otus.gza" but almost nothing is obtained.
But, before that I am a bit confused.
Because I thought my goal (the frequency table) is the OTU.
So, it this DADA2 not directly related to my purpose?

I think I must try this.
It feels that I have too less number of features.

I appreciate all the comments again!
Those are very helpful and give me a new direction to go.
Some of them, I might not properly reply due to the lack of idea.
I am also watching some online videos regarding the field.
Hope I can get improved consistently.

llenzi · November 1, 2018, 3:08pm

Hi,
just one thing I note on your command, but it may be just a typo.
Are you using the output form the dada2 step as input for the vsearch clustering?
I'm asking because you either use dada2 (for a de-novo approach) or vsearch (for a closed reference approach).
The correct input for the vsearch would be the same as you used for the dada2: demux_A4.qza
That would explains why your vsearch clustering step is so quick: is actually working on a pre-processed set of sequences.
Luca

thermokarst · November 1, 2018, 4:08pm

Thanks @llenzi! I just want to point out that there is probably nothing technically wrong with clustering features produced by denoising tools like dada2 and deblur, for example, closed-reference clustering of your post-dada2 reads might be appropriate for some cases. I just want to point out that this is why the clustering commands in the q2-vsearch plugin accept FeatureData[Sequence] and FeatureTable[Frequency], rather than SampleData[SequencesWithQuality].

Mehrbod_Estaki · November 1, 2018, 6:37pm

Hi @fblues,

The output of both OTU picking methods and denoising methods (DADA2/deblur) are a table of features x frequencies. The differences is that in OTU picking we called the features OTUs while with denoisers we refer to them as amplicon sequence variants (ASVs) or some other names. In the steps following these are for the most part all the same and your colleagues can still use the ASV table in R to analyse them the same. I would recommend reading this paper discussing the difference between the two and why OTUs should not be used.

Also, just to reinforce @thermokarst's comment, if you really need to use
OTU picking methods for some reason, you can, and in fact I would recommend, using the output of DADA2 into vsearch. But @llenzi is right in that if you are using the denoised feature table into vsearch it would explain why it is taking so much less time since most of your reads have already been filtered by this point.

fblues · November 1, 2018, 8:05pm

Thank you for the kind comments, @llenzi, @thermokarst, and @Mehrbod_Estaki!!

I am realizing the ASVs and OTUs are comparable objects in some sense.
I will read the paper during the weekend to decide which way I should go.

Also, I really want to try the approach that @Mehrbod_Estaki mentioned in his earlier advice.
So, in order to only use the forward reads, I have to use the FASTQ files labeled with R1 in its name.
Also, the input format would be now "SingleEndFastqManifestPhred33" with corresponding manifest file.
Is this right direction?

I will run the procedure and post the result to seek further advice.

Mehrbod_Estaki · November 1, 2018, 11:15pm

Hi @fblues,
You're right that they are comparable, and hopefully after you read that paper you'll realize that the ASVs are the way to go Some rare cases can still benefit from OTUs but for the most part, we should be looking towards ASVs by default.

Instead of re-importing your forward reads only, just use your existing paired-demultiplexed file and run dada2 denoise-single and this will just ignore your reverse reads and work on the forward reads alone. Saves you some hassle.
Good luck!

fblues · November 2, 2018, 2:23pm

Thank you, @Mehrbod_Estaki!

I actually tried it and obtain the following output.
table_A4_forward.qzv (371.0 KB)
From my observation, # of features and # of total frequency increase about three times than those of earlier.
I believe this is a favorable situation.

From the outputs, is there any way
I can safely export the information to a format (ASV or OTU) that I can use in R?
I have used the following command:

qiime tools export
--input-path table-or-97_forward.qza
--output-path exported-feature-table_forward

But, it creates a directory with a biom format file.
This file seems very small (around 130 Kbytes)
and it generates some error when I try to import it.

This might be a lazy question because
I feel similar questions were asked earlier by some one else.
I will also search further regarding this topic for myself.

Thanks again and sorry for bothering you too much!

Mehrbod_Estaki · November 2, 2018, 6:23pm

Hi @fblues,
Great, that is much better!
Qiime2 (as with qiime1) works with .biom formats in the background so you would have to convert those to a readable format for R. For example you can use the biom convert option already installed alongside your qiime2. But even easier, see this nifty tool for importing qiime2 artifacts into R. Makes life a whole lot easier.

llenzi · November 5, 2018, 10:49am

AH, thanks Matthew to point this out, I had not realised that yet. My bad.
I learned something new today!

Luca