Has anyone used chloroplast "contaminants" to do host phylogeography?

bpscherer · January 18, 2021, 8:00pm

Hi community,

I have 16S data from ~270 plant samples, and most of them have a significant degree of chloroplast "contamination." One of my projects originally sought to combine this data with RADseq data from the host plants in order to assess the influence of host phylogeny and geography on endophytic bacterial communities.

Due to COVID and time/budget issues, it is looking like it won't be feasible for me to generate this RADseq data. I am a Ph.D. student trying to complete my dissertation, and cannot afford to delay graduation forever.

My advisor and I were talking, and he asked if I could do anything with the chloroplast sequences. In a sense, I already have host genetic data, though I'm not sure how useful these 16s chloroplast sequences will be for describing variation in the host genetics.

As part of my exploration of this topic, I thought I would post here and see if anyone has any ideas, or if any of you are familiar with people doing anything like this.

Thanks for any help or ideas!

Nicholas_Bokulich · January 20, 2021, 2:30pm

I have not heard of this being done (so check the lit before listening to me!), but just want to give an idea of how to test this question:

Use RESCRIPt to create a plastid 16S gene database, and use the same plugin to test taxonomic resolution. That would validate at which level you can reliably distinguish plants (family? species? strain?), so maybe enough to justify this approach if you are looking at broad phylogenetic groups.

But first you would also need to validate that you have adequate coverage of plastid 16S seqs using your same PCR primers...

good luck!

bpscherer · January 20, 2021, 6:40pm

Thanks for the idea! I'm interested in this approach, but I have a few follow up questions.

By this, do you mean I just need to have an idea of how many plastid sequences I have/their distribution in my dataset?

Here is a screenshot of my chloroplast/mitochondria table:

I also have my rep-seqs file, and a taxonomy file I used for the entire dataset that I can use to determine taxonomy of specific features. I think I used the following commands to create/train my classifier.

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads silva-138-99-seqs.qza
--i-reference-taxonomy silva-138-99-tax.qza
--o-classifier classifier.qza

qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads demux-rep-seqs.qza
--o-classification custom-taxonomy.qza

From what I am understanding, RESCRIPt would be a different way to assign taxonomy to these sequences and allow me to have a more precise understanding of the variation in this dataset of chloroplasts and mitochondria?

Thanks again for your post, and for any additional insight or information! I haven't been able to find much in the literature, but it would be great if I could develop this into something.

Nicholas_Bokulich · January 20, 2021, 7:08pm

No, RESCRIPt would be a way to make a reference database specific for plastid 16S sequences (e.g., from Genbank with the right query terms), and then test the taxonomic resolution of that database via cross-validated classification of those same reference sequences. Not something that you can do with the data you have now, since your biological data presumably lacks a ground truth (true known composition).

No I mean that you would need to run a simulated PCR on reference plastid 16S sequences to see how many are hit by your primers... if your primers only hit some plastid 16S sequences, then they could skew your results.

bpscherer · January 21, 2021, 7:28pm

Okay, so to do this, could I use a tool like SnapGene? Where would I get the reference sequences, Genbank or similar?

Additional background info, the host plants I am interested in are all the same species (Rhizophora mangle), spread out across the state of Florida in the US.

Thanks again for your insight!

Nicholas_Bokulich · January 23, 2021, 7:59am

I am not familiar with this tool but yes an in-silico PCR. For a QIIME 2 solution, qiime feature-classifier extract-reads would work too. Better yet, use something like primerprospector, which will report primer coverage by each clade.

Yes, see my initial post above — RESCRIPt will help with this.

hm... I am not a plant biologist but this additional information makes me think that this solution is unlikely. i.e., I am skeptical that you can use short reads of plastid 16S rRNA genes to get subspecies-level resolution.

I suppose you could always look in your data to see what the plastid sequences in your query seqs are being classified as, and/or if there are regional patterns (e.g., in beta diversity) based on plastid ASVs alone, but it might be challenging to validate and justify this approach.

If you give it a try, let us know what you find.

Good luck!

Nicholas_Bokulich · January 26, 2021, 9:18pm

A post was split to a new topic: RESCRIPt AttributeError: ‘DNAFASTAFormat’ object has no attribute ‘view

bpscherer · January 28, 2021, 7:46pm

Looks like I was using 2019.7. My fault for not installing the newest version!

I'm trying to use RESCRIPt to create a reference database of 16S plastid sequences as described in my previous conversation here with @Nicholas_Bokulich.

I'm a little unsure of how to use this approach to only get a database of plastids. I feel like it would make sense to start by only downloading plastid sequences, but I'm having some difficulty using the --p-query option to get what I am looking for.

Thanks for any additional help!

Nicholas_Bokulich · January 31, 2021, 10:28am

Hi @bpscherer,

I moved your response back into this topic, as you are asking about search criteria, unrelated to the error message you enountered with RESCRIPt, which was solved in the separate topic.

Instead of adjusting the download criteria, you might be able to just filter what you already have. You downloaded all 16S refseqs using the command above. Provided those include enough chloroplast sequences, you can filter those refseqs to grab only chloroplast sequences using qiime taxa filter-seqs, see here for more details:
https://docs.qiime2.org/2020.11/tutorials/filtering/#filtering-sequences

But I am not sure offhand if RefSeqs contains any chloroplast seqs... you should inspect the results. If there are not enough chloroplast sequences, you can consult NCBi taxonomy to determine an appropriate taxid and other criteria for adjusting your query.

But as I mentioned above, I think this might be a bit of a rabbit hole:

... it's worth checking out, but if things don't look promising early on I suggest you don't venture too far in!

Good luck!