I'm a PhD Student that is trying to figure out how to analyze Ion Torrent Data. I think I already have my final pipeline described, almost all of it. However, I need to answer one last question to myself and I think your advice could be really helpful.
The main doubt I am facing is that I don't know whether to trim my sequences or maintain their lengths. As a first approach, I did the trun-len as it is explained in the QIIME2 tutorials, therefore, trimming the sequences when the Quality Score gets down. However, when I came across the post where a IonTorrent Data Pipeline is described, I read the following comment:
Could you clarify? Are you using extract-reads on the reference sequences? If so, don’t use trunc-len on those reads (since your query seqs are in mixed orientations they can hit at either end of that read)
Did this mean to not trunc-len Ion Torrent Data in general?
Another question I have is if it is necessary to do a Closed-Reference OTU clustering (as it was described in that previous post) or if It is possible to work directly with the ASV created in DADA2 step.
Thank you so much in advanced! I'm a bit lost and I will appreciate your help
Let me tell my experience on using Ion Torrent Sequences in DADA2, Well I had tried with trunc-len command and the results were terrible, only less% had passed the filter and I think it is better to used go with trunc-len as 0 along with trim-left 15. For further reference,please do this check this one.
Note - I had used single end sequences for the analysis. And I am not also an expert in this one so it may be different for other IT sequences.
I think it depends upon the question you are asking but you can do both the methods or you can also do DADA2 ASV then use that sequences for the OTU clustering.
I think that post specifically refers to the vsearch classifer they were training for ASVs.
IDK if it helps or makes things more complicated but I was just involved in a benchmark of a set of methods to scaffold multiple regions. If you can get the primers somehow, you may want to look at the sidle plugin, since this performed best for multiple region scaffolding.
Hello @Sreevatshan and thank you for your rapid answer
I'm using Denoise-pyro and tim-left 15 too. When I do trunc-len 111 my sequences, the % of sequences that pass the filter is around 70%, that is not that bad... I think. However, It is true that when I do trunc-len 0, this % is increased to 85-89%.
The thing is that, after I do the Taxonomic Analysis against the database (with Classify-Consensus-VSEARCH classifier method), I obtain the following results:
The Total corresponds to the total FeatureID-s that have been found and are contained in the Taxonomy.qzv after VSEARCH.
The Unassigned, therefore, to the FeatureID-s whose Taxonomy is =Unassigned.
Finally, the Tax_quality parameter refers to those FeatureIDs that have a Confidence > 0.5 (The idea was to get how many of the total featureIDs were of good quality.
Taking these results into account, It seems that using trunc-len 111 would be the best option, since it is the method that finds the highest amount of "good" features. However... I don't understand how is it possible to obtain more Features when the % of sequences that pass the DADA2 filter is smaller.
I'm actually concerned about the quality of your classifier if ~4x as many reads are classified as "unassigned". Could you post your classification command?
I think the reason for so many "Unassigned" classifications is that I am analyzing Mycobiota, mapping my sequences against UNITE database (ITS gene). When I do the same Pipeline (or similar) with my Microbiota data, using GreenGenes DB, the Unassigned % drops to 5-10%.
The fact that its lower with 16S makes me feel a little bit better! I think in that case, the trimming makes sense.
There are two things at play. With ASVs, trimming sooner gives you more high quality sequences. So, regardless of the actual sequence, you may have more counts with trimming. But, imagine I have two 3 sequences:
CATCATCAT
CATCATCAG
CATATATAT
If I trim everything to 3 bp then I would get 1 ASV: CAT. If I trim too 8, I would get 2: CATCATCA and CATATATA. If leave the sequences untrimmed, I get 3. The longer the sequence, the more variation you're able to capture.
WOW @jwdebelius !! What a nice example to understand this "problem" !!
Thank you so much! I will discuss this same example with my superiors and my lab mates and will let you know which approach are we finally using..
Besides that, what do you mean here? Where does the trimming make sense? In the 16S or in ITS? The fact that its lower with 16S makes me feel a little bit better! I think in that case, the trimming makes sense.
[quote="MiriamGorostidi, post:14, topic:19394"]
Besides that, what do you mean here? Where does the trimming make sense? In the 16S or in ITS? The fact that its lower with 16S makes me feel a little bit better! I think in that case, the trimming makes sense.
[/quote]h
I think specifically in the ITS example you've shown, trimming makes sense. But likely in the 16S as well.
I'm sorry @jwdebelius , but I come with some news (not really good). I repeated my the comparative process again in a new set of samples, and I got the followig:
If I do the --trun-len in 210, that is where the quality score drops down:
DADA2: Only the 25-40% of the sequences pass the filter and 45954 representative sequences are found.
When I do the taxonomic analysis mapping against UNITE database:
Total
Classified
K_L1
P_L2
C_L3
O_L4
F_L5
G_L6
S_L7
Total
45954
1604
22
17
7
44
367
852
294
Unassigned
44350
NA
NA
NA
NA
NA
NA
NA
NA
Taxon_Unique
210
NA
NA
NA
NA
NA
NA
NA
NA
However, If I do the --trun-len in 0, therfore, no trimming my sequences:
DADA2: Only the 88-90% of the sequences pass the filter and 68659 representative sequences are found.
When I do the taxonomic analysis mapping against UNITE database:
Total
Classified
K_L1
P_L2
C_L3
O_L4
F_L5
G_L6
S_L7
Total
68659
1166
22
9
9
35
258
629
203
Unassigned
67493
NA
NA
NA
NA
NA
NA
NA
NA
Taxon_Unique
204
NA
NA
NA
NA
NA
NA
NA
NA
What do you think about this? Should I skip trimming, so a high % of my sequences pass the filter? Or continue trimming?
The DADA2 statistics are a really good guide as to where things are failing and why, so I might explore that if you're interested. I do less work in the ITS realm, so i'm less sure where trimming might have a big effect here.
In general, my approach tends to be "good enough": do what makes sense and report it because there's often a range of optimum answers, depending on what your goal is.