Using not merged PE data

Peter_Kos · June 20, 2023, 10:09am

Can I use PE sequences where the reads are shorter than half of the amplicon so that they can not be joined by overlap? Even if they have a gap of unknown length between them, the two ends still hold a plethora of additional information for similarity-based taxonomy and extra depths for k-mer-based one, not to mention the resolution of the whole sequencing project.

Nicholas_Bokulich · June 20, 2023, 10:23am

Hi @Peter_Kos ,
This is a great question. This is something on the to-do list, but it is currently not possible in QIIME 2 or really any other package or method out there — the issue is that this is not implemented in the underlying methods so getting these methods to handle such data would be a significant hurdle.

jwdebelius · June 20, 2023, 1:43pm

Hi @Nicholas_Bokulich and @Peter_Kos,

In theory, Sidle might work for this, but I haven't tried the application explicitly.

It will scaffold forward and reverse reads together, so that's not an issue. I think the key for decent resolution here might be to do a two-step database construction, where you extract the full region you plan to use, and then dereplicate your taxa with RESCRIPt and then prepare your database.

Your results are lower resolution than an ASV table, but you would still be able to combine the reads, and you wont have taxonomic assignment classification issues like you might run into with a padded bring of N's.

Best,
Justine

Peter_Kos · June 21, 2023, 2:49pm

Thanks a lot for the comments from both of you.

Peter

Peter_Kos · August 24, 2023, 3:38pm

Dear Justine,

thanks for the suggestion. I have read your paper and finally now I found some time to start dealing with this possibility on its merit.
The first question that arises if it is possible at all to create Sidle-compatible databases from PR2. Even in the realm of bacteria, you say that "Database selection should be considered carefully if you plan to reconstruct a phylogenetic tree. Currently, only greengenes 13_8 and Silva 128 are compatible with tree building. Other versions may fail or the tree may not be constructed correctly."
What can I possibly hope for with a different database with 10 taxonomical levels and possible further differences if not even all Silva databases work?
I was hoping to be able to use this database as the plugins I have used so far in this project can handle PR2 nicely.
[The other possibility would be perhaps Silva, but my copy of SILVA_138_SSURef_NR99_tax_silva_trunc.fasta seems to have only 58563 Eukaryotes (have not checked v128) whereas PR2 is a Eukaryote database with about 220k entries.]
In the last weeks I calculated and compared the compositions of my environmental samples using 4 different variable regions, PE and single end, and it would be very exciting to see what these regions can say together.

Thx a lot for any comments, help, encouragement, etc.
All the best
Peter

jwdebelius · August 24, 2023, 4:12pm

Hi @Peter_Kos,

There are 3 layers of possible issues.

Direct tree recostruction. Sidle reconstructs off the q2-fragment-insertion plugin and it requires that hte tree you use to scaffold your data match the database you use to reconstruct your data in boht version and number. If you want to build a PR2 fragment insertion tree, Im sure the community would welcome it. That said, if you're interested in a phylogenetic hypothesis with your sequences directly, tree building on your short reads might be a better answer for you.
Using other databases wtih sidle if you're not building a tree is not a problem from a sequence perspective. I think the preprint uses Optivag, which is a completely different database. It just doesnt allow phylogenetic reconstruction (see point 1).
That said, I think the code currently enforces a 7 level taxonomic hierarchy. This isn't required, per-say, but was implemented to make taxonomy cleaning easier.

Best,
Justine