Questions about Ion Torrent Data Analysis

Hey you all!

I’m a PhD Student that is trying to figure out how to analyze Ion Torrent Data. I think I already have my final pipeline described, almost all of it. However, I need to answer one last question to myself and I think your advice could be really helpful. :slight_smile:

The main doubt I am facing is that I don’t know whether to trim my sequences or maintain their lengths. As a first approach, I did the trun-len as it is explained in the QIIME2 tutorials, therefore, trimming the sequences when the Quality Score gets down. However, when I came across the post where a IonTorrent Data Pipeline is described, I read the following comment:

Could you clarify? Are you using extract-reads on the reference sequences? If so, don’t use trunc-len on those reads (since your query seqs are in mixed orientations they can hit at either end of that read)


Did this mean to not trunc-len Ion Torrent Data in general?


Another question I have is if it is necessary to do a Closed-Reference OTU clustering (as it was described in that previous post) or if It is possible to work directly with the ASV created in DADA2 step.


Thank you so much in advanced! I’m a bit lost and I will appreciate your help :smiley:

Bests,

Miriam

Hey @MiriamGorostidi ,

Let me tell my experience on using Ion Torrent Sequences in DADA2, Well I had tried with trunc-len command and the results were terrible, only less% had passed the filter and I think it is better to used go with trunc-len as 0 along with trim-left 15. For further reference,please do this check this one.

Note - I had used single end sequences for the analysis. And I am not also an expert in this one so it may be different for other IT sequences.

  1. I think it depends upon the question you are asking but you can do both the methods or you can also do DADA2 ASV then use that sequences for the OTU clustering.

Best,

Sreevatshan.

Hi @MiriamGorostidi,

I think that post specifically refers to the vsearch classifer they were training for ASVs.

IDK if it helps or makes things more complicated but I was just involved in a benchmark of a set of methods to scaffold multiple regions. If you can get the primers somehow, you may want to look at the sidle plugin, since this performed best for multiple region scaffolding.

Best,
Justine

2 Likes

Hello @Sreevatshan and thank you for your rapid answer :slight_smile:

I’m using Denoise-pyro and tim-left 15 too. When I do trunc-len 111 my sequences, the % of sequences that pass the filter is around 70%, that is not that bad… I think. However, It is true that when I do trunc-len 0, this % is increased to 85-89%.

The thing is that, after I do the Taxonomic Analysis against the database (with Classify-Consensus-VSEARCH classifier method), I obtain the following results:

DADA2 Pyro --trun-len 111 DADA2 Pyro --trun-len 150 DADA2 Pyro --trun-len 0
Total= 4706 Total= 6361 Total= 928
Unassigned= 3738 Unassigned= 5427 Unassigned= 483
Tax_quality= 968 Tax_quality= 934 Tax_quality= 445

The Total corresponds to the total FeatureID-s that have been found and are contained in the Taxonomy.qzv after VSEARCH.

The Unassigned, therefore, to the FeatureID-s whose Taxonomy is =Unassigned.

Finally, the Tax_quality parameter refers to those FeatureIDs that have a Confidence > 0.5 (The idea was to get how many of the total featureIDs were of good quality.

Taking these results into account, It seems that using trunc-len 111 would be the best option, since it is the method that finds the highest amount of “good” features. However… I don’t understand how is it possible to obtain more Features when the % of sequences that pass the DADA2 filter is smaller.

What should be the best option here?

Could you @jwdebelius take a look to this please? :sweat_smile:

Thank you to both of you :heart_eyes:

Thank you @jwdebelius ! I have already downloaded the paper and will read it as soon as I can!! :slight_smile:

Hi @MiriamGorostidi,

I’m actually concerned about the quality of your classifier if ~4x as many reads are classified as “unassigned”. Could you post your classification command?

Thanks,
Justine?

Hi @jwdebelius !

I think the reason for so many “Unassigned” classifications is that I am analyzing Mycobiota, mapping my sequences against UNITE database (ITS gene). When I do the same Pipeline (or similar) with my Microbiota data, using GreenGenes DB, the Unassigned % drops to 5-10%.

However, here I post my command:

qiime feature-classifier classify-consensus-vsearch \
 --i-query ${DIR}/Dada2_output/rep-seqs-pyro-noTrun.qza \
 --i-reference-reads ${DIR}/ITS_UNITEdatabase/unite_dyn_refs.qza \
 --i-reference-taxonomy ${DIR}/ITS_UNITEdatabase/unite_dyn_taxa.qza \
 --o-classification ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qza

qiime metadata tabulate \
  --m-input-file ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qza \
  --o-visualization ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qzv

qiime taxa barplot \
  --i-table ${DIR}/Dada2_output/table-pyro-noTrun.qza \
  --i-taxonomy ${DIR}/Taxonomic-Analysis-vsearch/taxonomy-pyro-noTrun-unite-vsearch.qza  \
  --m-metadata-file ${DIR}/samples-metadata.tsv \
  --o-visualization ${DIR}/Taxonomic-Analysis-vsearch/taxa-bar-plot-pyro-noTrun-unite-vsearch.qzv

***And the same with Trun :slight_smile:

Thank you so much!!

Hi @MiriamGorostidi,

The fact that its lower with 16S makes me feel a little bit better! I think in that case, the trimming makes sense.

There are two things at play. With ASVs, trimming sooner gives you more high quality sequences. So, regardless of the actual sequence, you may have more counts with trimming. But, imagine I have two 3 sequences:

CATCATCAT
CATCATCAG
CATATATAT

If I trim everything to 3 bp then I would get 1 ASV: CAT. If I trim too 8, I would get 2: CATCATCA and CATATATA. If leave the sequences untrimmed, I get 3. The longer the sequence, the more variation you’re able to capture.

Best,
Justine

2 Likes

WOW @jwdebelius !! What a nice example to understand this “problem” !!

Thank you so much! I will discuss this same example with my superiors and my lab mates and will let you know which approach are we finally using…

Besides that, what do you mean here? Where does the trimming make sense? In the 16S or in ITS?
The fact that its lower with 16S makes me feel a little bit better! I think in that case, the trimming makes sense.

Thank you so much :smiling_face_with_three_hearts:

3 Likes

[quote=“MiriamGorostidi, post:14, topic:19394”]
Besides that, what do you mean here? Where does the trimming make sense? In the 16S or in ITS?
The fact that its lower with 16S makes me feel a little bit better! I think in that case, the trimming makes sense.
[/quote]h

I think specifically in the ITS example you’ve shown, trimming makes sense. But likely in the 16S as well.

Best,
Justine

Okey then!!

Thank you sooooo muuuch Justine :smiling_face_with_three_hearts:

Best!!

1 Like

I’m sorry @jwdebelius , but I come with some news (not really good). I repeated my the comparative process again in a new set of samples, and I got the followig:

  1. If I do the –trun-len in 210, that is where the quality score drops down:
    DADA2: Only the 25-40% of the sequences pass the filter and 45954 representative sequences are found.
    When I do the taxonomic analysis mapping against UNITE database:
Total Classified K_L1 P_L2 C_L3 O_L4 F_L5 G_L6 S_L7
Total 45954 1604 22 17 7 44 367 852 294
Unassigned 44350 NA NA NA NA NA NA NA NA
Taxon_Unique 210 NA NA NA NA NA NA NA NA
  1. However, If I do the –trun-len in 0, therfore, no trimming my sequences:
    DADA2: Only the 88-90% of the sequences pass the filter and 68659 representative sequences are found.
    When I do the taxonomic analysis mapping against UNITE database:
Total Classified K_L1 P_L2 C_L3 O_L4 F_L5 G_L6 S_L7
Total 68659 1166 22 9 9 35 258 629 203
Unassigned 67493 NA NA NA NA NA NA NA NA
Taxon_Unique 204 NA NA NA NA NA NA NA NA

What do you think about this? Should I skip trimming, so a high % of my sequences pass the filter? Or continue trimming?

Hi @MiriamGorostidi,

The DADA2 statistics are a really good guide as to where things are failing and why, so I might explore that if you’re interested. I do less work in the ITS realm, so i’m less sure where trimming might have a big effect here.

In general, my approach tends to be “good enough”: do what makes sense and report it because there’s often a range of optimum answers, depending on what your goal is.

Best,
Justine