Any other option for doing the feature table without trimming from PGM data?

Hi,
I want to know if exist the possibility to do the feature table by any method that it do not need any trimming when you have sequences with different lengths.

In addition my data from PGM can be in both directions (forward and reverse complement) because the way to add the adaptors+barcode after amplification of my amplicon.

Any idea about my options?

thank you!
Best

You can disable read trimming with both dada2 and deblur. To do so, I believe you set the trim length to 0 in dada2 and to -1 in deblur but check the documentation to confirm.

You can also use OTU clustering with q2-vsearch, which should not (or would not need to) trim any of your reads.

We have also run into some irregular Illumina library preps that have this same issue. We do not yet have a good way for handling this in QIIME2 because it is irregular in terms of 99% of the data we've run into.

I'm sure you are aware this will cause issues, e.g., with unique ASV/OTU calling, since exact reverse complements of the same sequence will become 2 unique features! And also issues with taxonomy assignment as we've discussed on separate threads.

For now I believe the best approach may be to try to re-orient these reads before they are passed into QIIME2 (though you could also export, reorient/merge features, import back to QIIME2 after denoising but it will be much messier). If you can easily recognize the read orientation, e.g., if primers are still in the raw sequences, there might be other tools out there that do this. But I do not know of a tool for this β€” other forum users might be able to help.

I hope that helps!

Hi @Nicholas_Bokulich,
My first option was disable the trimming in dada and deblur but they detect that my sequences are with different length and give me an error about that so That option is not possible for me.

Regarding the orientation, I know that the alpha diversity is overestimate at OTU level because forward and reverse from the same sequence (variant) will be consider like different variant when really are the same. However, for the moment (until better solution) I can work at species level and the result would be good.

However, regarding my taxonomic results, the orientation seems not be the problem. I analyzed my samples with QIIME1 (filtering previously by quality and length min with Cutadapt program). I used OTU picking Open reference (97% for OTUs similarity and Taxonomic assignment) and Uclust method and my results are good for my mock communities. But when I import in QIIME 2 the same samples (filtered by Cutadapt) and I used dereplicate, Vsearch OTU picking (also with 97%) and Vsearch classifier (97%) trying to copy the way in QIIME1 but with QIIME2 and I did not find my target species.
I do not know why, I do not have so much settings to change ...

Thank you very much for your help and your ideas!
Best,
MMC

Oops β€” sorry, processing PGM data is learning process for me as well :grin:

Perfect!

I also just saw this post crop up on the forum:

They are using Illumina data, but their reorientation script might be useful for you β€” perhaps you should follow that thread and/or contact that user for help, to see if their reorientation script might be useful for you.

Interesting. There are obviously two different changes here between QIIME1 and QIIME2 that could both be causing this disparity:

  1. The OTU picking pipelines should be very similar, but do use different algorithms (uclust vs. vsearch).
  2. The taxonomy classifiers are different. I have found that vsearch performs similarly or better than uclust on mock communities, but that was with 16S and fungal ITS... it could be a very different story for 18S reads.

The vsearch classifier does have a number of different parameters to change. Especially --p-perc-identity, --p-maxaccepts, and --p-min-consensus will alter behavior, and may improve classification. If you are not already, I would recommend using a reference database with sequences clustered at 99% rather than 97%.

Sorry I can't provide clearer answers β€” since your data type is currently constrained from using dada2/deblur and the sklearn classifier in QIIME2 there are limited options for trying alternative approaches. It is particularly troubling because those would be are preferred/recommended approaches for Illumina 16S data, so you are sort of "stuck" doing QIIME1-style analyses in QIIME2 right now. The fact that QIIME1 is doing better is a bit concerning... but it seems that there may be more optimization possible with the vsearch classifier for 18S data, that I hope may help.

Please let us know if that helps!

I changed those parameters and for example --p-maxaccepts and --p-min-consensus (e.g.: 50 maxaccepts and 70% consensus; 10maxaccept (by default) but 65 or 61% consensus, ...) and finally the best results (closer to results from our mocks and results from QIIME1) were gotten with the parameters by default.
Regarding --p-perc-identity I used 97%, 99%, and 90% but the best results were with 90% (very low compared with 97% used in QIIME1 with good results).

One question, when you say that it could be better use sequences clustered at 99% you speak about my sequences or also the sequences from database (Silva have database clustered at 97% and 99%). Currently I used 97% Silva database because were good results in previous studies and in the current study with QIIME1 but maybe in QIIME 2 is better used the 99% clustered database?

Thank you for your really appreciated help.
MMC

ok, I will try to contact with that user to speak about the reorientation. In the middle, if I am using Vsearch open reference otu picking, that option has the possibility to use both strands in the command --p-strand so I understand that in this case the program check both directions and your final diversity will be correct and not overestimated. Is this right?

Oh good to hear β€” did I misunderstand or above weren't you saying that the defaults were not so good in QIIME2 for your mock community? Were those with non-default parameters?

That is low, but it is really going to depend on the reference database and marker gene β€” since you are doing 18S it is totally conceivable that different parameter settings would optimize this (the QIIME2 defaults are mostly based on 16S). Once again, it is great that you have a mock community to optimize this independently :grin:

I find that 99% is best β€” it is going to retain the highest amount of sequence information and potentially lead to more specific classifications (or so I find with 16S). Again, this will depend on many factors, though. Many of the QIIME1 recommendations were made in a different time β€” e.g., when the different in memory and runtime requirements between 97% and 99% (both the OTU cluster % of the reference sequences and the percent identity for OTU clustering and sequence matching) was much more significant. It is less of a constraint on current systems.

Thanks for testing this out and sharing your findings!

Exactly, until date I do not have good results for my samples with QIIME2 (chanching or not the default settings), but the closer results to reallity and QIIME1 results in my mock communities were with default parameters in the classifier Vsearch and 90% of identity. If change the consensus, or max accept hits, the results varied, obviously, but not in the good direction.

I will continue trying use QIIME2 but for the moment the assignment of taxonomy will be done with QIIME1. Then can I import my biom in QIIME2 and use it for the analyses downstream.

Thank you very much for your appreciated help. If I have updates i will share here for all :slight_smile:
thanks again!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.