Seeking advice for a Thesis project

nara · December 29, 2025, 5:29pm

Hi everyone,

I am currently an undergraduate Data Science student looking for advice on structuring a workflow for my upcoming thesis.

My Background: I have successfully completed the standard QIIME 2 tutorials (like Moving Pictures) and have a basic understanding of amplicon analysis. Since my major is Data Science, I am also have some knowledge in Machine Learning and can programe in Python and R.

The Goal: I am looking to define a scope for my thesis that goes deeper than the standard tutorial workflow. I want to demonstrate a more advanced application of microbiome analysis, but I am currently stuck on brainstorming a specific direction.

My Questions: Could anyone suggest ways to extend a standard 16S pipeline to make it suitable for a Data Science thesis?

Specifically since I do not have my own samples, could you recommend high-quality public datasets related to human gut health or food microbiology?

Any guidance would be greatly appreciated!

Thank you so much for your time and help.

gregcaporaso · January 11, 2026, 6:13pm

Hi @nara,
Welcome to the QIIME 2 Forum!

This is a bit of a different direction than you're describing here, but a data set that you could consider would be the full gut-to-soil dataset, which we have archived under CC-BY license (meaning you can use it for whatever you want - you just need to cite the original paper) in Zenodo here. Probably the best way to learn about the data and study is via the gut-to-soil tutorial and the links therein. While this data isn't specifically focused on human gut health or food microbiology, my pitch is that it's related to both: the composting material can support the growth of high quality fiber-rich food (e.g., in areas where it is challenging to grow high quality foods due to poor soil quality), and that food in turn can support human gut health. In other words, the data can be used to understand how a material that we are used to thinking of as waste can be cycled to support human health, in the process supporting sanitary management of human excrement and environmental sustainability (e.g., through reduced reliance on fertilizers).

One problem, off the top of my head, that could be interesting for a data science PhD is how to align the samples in a timeseries based on characteristics of the samples (e.g., their phylogenetic composition) as opposed to strictly based on the time when the sample was collected. This could be relevant for studies such as this, where we're trying to understand a microbially driven process through replicated time series data, and the process that we're studying might take more or less time in different replicates. Here the process is composting, but the process could alternatively be fermentation, development of plant-supporting microbial communities in soil, etc, and this would make it easier to get insight into "microbial phases" that occur throughout the process, or to quantify or rank different rates at which the process occurs in different replicates.

For example, in this data you might align the 15 replicates (buckets) based on their phlyogenetic composition, and then try to layer on when E. coli begins to disappear based on the paired culturing and qPCR data. That could tell you whether there are certain patterns in the composition of the samples when this important step (an indicator of the safety of the material) happens, and whether that pattern is disrupted or missing when E. coli doesn't disappear. This could inform optimization of the process.

We are in the early stages of sequencing metagenomes and metatranscriptomes from a follow-up thermophilic HEC study here, and within about 12-18 months I expect that all of those data will be public as well. That data set will have lots of multi-omics data integration challenges which we won't solve in our initial publications, and questions related to quantifying and exploring functional dark matter (ie., active genes of unknown function) throughout the process. At this stage we have collected all of the samples and are just about to start the sequencing, so the caveat here is that the data doesn't exist yet - so is probably a little risky to plan a PhD around. Depending on how early you are in your PhD though, if you're interested in the system and the problems this could be something to look forward to.

Hope this helps a little, good luck!

Update: I edited this a little after posting to describe a possible data science project - the initial idea would have been more of a biology project. Note that pre-existing work probably exists in this area - as always, it's good to start a PhD project with a thorough literature review.

nara · January 12, 2026, 1:01am

Dear @caporaso,

Thank you so much for your warm welcome and for the detailed suggestions!

I also wanted to provide a small clarification: I am currently a final-year undergraduate student working on my thesis, rather than a PhD student. However, your insights are extremely helpful as I consider my next steps in data science and microbiology.

Thank you again for your time and for pointing me toward these valuable resources!

gregcaporaso · January 12, 2026, 1:20pm

Ah, I did misunderstand. Well in any case I hope this can help a little in figuring out a direction for your work! There is a lot unexplored in that data, so I do still it can be a useful one for you to work with. (I forgot mention - I am a little biased, as it's data we generated in my lab and I'm interested in seeing other people work with it and work on the system.)

Good luck!