Questions about Ion Torrent Data Analysis

MiriamGorostidi · April 28, 2021, 10:33am

Hey you all!

I'm a PhD Student that is trying to figure out how to analyze Ion Torrent Data. I think I already have my final pipeline described, almost all of it. However, I need to answer one last question to myself and I think your advice could be really helpful.

The main doubt I am facing is that I don't know whether to trim my sequences or maintain their lengths. As a first approach, I did the trun-len as it is explained in the QIIME2 tutorials, therefore, trimming the sequences when the Quality Score gets down. However, when I came across the post where a IonTorrent Data Pipeline is described, I read the following comment:

Could you clarify? Are you using extract-reads on the reference sequences? If so, don’t use trunc-len on those reads (since your query seqs are in mixed orientations they can hit at either end of that read)

Did this mean to not trunc-len Ion Torrent Data in general?

Another question I have is if it is necessary to do a Closed-Reference OTU clustering (as it was described in that previous post) or if It is possible to work directly with the ASV created in DADA2 step.

Thank you so much in advanced! I'm a bit lost and I will appreciate your help

Bests,

Miriam

Sreevatshan · April 28, 2021, 12:37pm

Hey @MiriamGorostidi ,

Let me tell my experience on using Ion Torrent Sequences in DADA2, Well I had tried with trunc-len command and the results were terrible, only less% had passed the filter and I think it is better to used go with trunc-len as 0 along with trim-left 15. For further reference,please do this check this one.

Note - I had used single end sequences for the analysis. And I am not also an expert in this one so it may be different for other IT sequences.

I think it depends upon the question you are asking but you can do both the methods or you can also do DADA2 ASV then use that sequences for the OTU clustering.

Best,

Sreevatshan.

jwdebelius · April 28, 2021, 1:11pm

Hi @MiriamGorostidi,

I think that post specifically refers to the vsearch classifer they were training for ASVs.

IDK if it helps or makes things more complicated but I was just involved in a benchmark of a set of methods to scaffold multiple regions. If you can get the primers somehow, you may want to look at the sidle plugin, since this performed best for multiple region scaffolding.

Best,
Justine

MiriamGorostidi · April 28, 2021, 2:31pm

Hello @Sreevatshan and thank you for your rapid answer

I'm using Denoise-pyro and tim-left 15 too. When I do trunc-len 111 my sequences, the % of sequences that pass the filter is around 70%, that is not that bad... I think. However, It is true that when I do trunc-len 0, this % is increased to 85-89%.

The thing is that, after I do the Taxonomic Analysis against the database (with Classify-Consensus-VSEARCH classifier method), I obtain the following results:

|DADA2 Pyro --trun-len 111|DADA2 Pyro --trun-len 150|DADA2 Pyro --trun-len 0|
|---|---|---|---|
Total= 4706 |Total= 6361|Total= 928
Unassigned= 3738 |Unassigned= 5427| Unassigned= 483
Tax_quality= 968| Tax_quality= 934| Tax_quality= 445|

The Total corresponds to the total FeatureID-s that have been found and are contained in the Taxonomy.qzv after VSEARCH.

The Unassigned, therefore, to the FeatureID-s whose Taxonomy is =Unassigned.

Finally, the Tax_quality parameter refers to those FeatureIDs that have a Confidence > 0.5 (The idea was to get how many of the total featureIDs were of good quality.

Taking these results into account, It seems that using trunc-len 111 would be the best option, since it is the method that finds the highest amount of "good" features. However... I don't understand how is it possible to obtain more Features when the % of sequences that pass the DADA2 filter is smaller.

What should be the best option here?

Could you @jwdebelius take a look to this please?

Thank you to both of you

MiriamGorostidi · April 28, 2021, 2:32pm

Thank you @jwdebelius ! I have already downloaded the paper and will read it as soon as I can!!

jwdebelius · April 28, 2021, 5:06pm

Hi @MiriamGorostidi,

I'm actually concerned about the quality of your classifier if ~4x as many reads are classified as "unassigned". Could you post your classification command?

Thanks,
Justine?

MiriamGorostidi · April 29, 2021, 7:40am

Hi @jwdebelius !

I think the reason for so many "Unassigned" classifications is that I am analyzing Mycobiota, mapping my sequences against UNITE database (ITS gene). When I do the same Pipeline (or similar) with my Microbiota data, using GreenGenes DB, the Unassigned % drops to 5-10%.

However, here I post my command:

qiime feature-classifier classify-consensus-vsearch \
 --i-query ${DIR}/Dada2_output/rep-seqs-pyro-noTrun.qza \
 --i-reference-reads ${DIR}/ITS_UNITEdatabase/unite_dyn_refs.qza \
 --i-reference-taxonomy ${DIR}/ITS_UNITEdatabase/unite_dyn_taxa.qza \
 --o-classification ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qza

qiime metadata tabulate \
  --m-input-file ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qza \
  --o-visualization ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qzv

qiime taxa barplot \
  --i-table ${DIR}/Dada2_output/table-pyro-noTrun.qza \
  --i-taxonomy ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qza  \
  --m-metadata-file ${DIR}/samples-metadata.tsv \
  --o-visualization ${DIR}/Taxonomic-Analysis-vsearch/taxa-bar-plot-pyro-noTrun-unite-vsearch.qzv

***And the same with Trun

Thank you so much!!

jwdebelius · April 29, 2021, 9:06pm

Hi @MiriamGorostidi,

The fact that its lower with 16S makes me feel a little bit better! I think in that case, the trimming makes sense.

There are two things at play. With ASVs, trimming sooner gives you more high quality sequences. So, regardless of the actual sequence, you may have more counts with trimming. But, imagine I have two 3 sequences:

CATCATCAT
CATCATCAG
CATATATAT

If I trim everything to 3 bp then I would get 1 ASV: CAT. If I trim too 8, I would get 2: CATCATCA and CATATATA. If leave the sequences untrimmed, I get 3. The longer the sequence, the more variation you're able to capture.

Best,
Justine

MiriamGorostidi · April 30, 2021, 12:38pm

WOW @jwdebelius !! What a nice example to understand this "problem" !!

Thank you so much! I will discuss this same example with my superiors and my lab mates and will let you know which approach are we finally using..

Besides that, what do you mean here? Where does the trimming make sense? In the 16S or in ITS?
The fact that its lower with 16S makes me feel a little bit better! I think in that case, the trimming makes sense.

Thank you so much

jwdebelius · April 30, 2021, 2:51pm

[quote="MiriamGorostidi, post:14, topic:19394"]
Besides that, what do you mean here? Where does the trimming make sense? In the 16S or in ITS?
The fact that its lower with 16S makes me feel a little bit better! I think in that case, the trimming makes sense.
[/quote]h

I think specifically in the ITS example you've shown, trimming makes sense. But likely in the 16S as well.

Best,
Justine

MiriamGorostidi · May 3, 2021, 7:15am

Okey then!!

Thank you sooooo muuuch Justine

Best!!

MiriamGorostidi · May 4, 2021, 12:02pm

I'm sorry @jwdebelius , but I come with some news (not really good). I repeated my the comparative process again in a new set of samples, and I got the followig:

If I do the --trun-len in 210, that is where the quality score drops down:
DADA2: Only the 25-40% of the sequences pass the filter and 45954 representative sequences are found.
When I do the taxonomic analysis mapping against UNITE database:

	Total	Classified	K_L1	P_L2	C_L3	O_L4	F_L5	G_L6	S_L7
Total	45954	1604	22	17	7	44	367	852	294
Unassigned	44350	NA	NA	NA	NA	NA	NA	NA	NA
Taxon_Unique	210	NA	NA	NA	NA	NA	NA	NA	NA

However, If I do the --trun-len in 0, therfore, no trimming my sequences:
DADA2: Only the 88-90% of the sequences pass the filter and 68659 representative sequences are found.
When I do the taxonomic analysis mapping against UNITE database:

	Total	Classified	K_L1	P_L2	C_L3	O_L4	F_L5	G_L6	S_L7
Total	68659	1166	22	9	9	35	258	629	203
Unassigned	67493	NA	NA	NA	NA	NA	NA	NA	NA
Taxon_Unique	204	NA	NA	NA	NA	NA	NA	NA	NA

What do you think about this? Should I skip trimming, so a high % of my sequences pass the filter? Or continue trimming?

jwdebelius · May 4, 2021, 3:50pm

Hi @MiriamGorostidi,

The DADA2 statistics are a really good guide as to where things are failing and why, so I might explore that if you're interested. I do less work in the ITS realm, so i'm less sure where trimming might have a big effect here.

In general, my approach tends to be "good enough": do what makes sense and report it because there's often a range of optimum answers, depending on what your goal is.

Best,
Justine

system · June 4, 2021, 9:51pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.