MinION 16S data for diversity analysis?

Shruthi · April 11, 2019, 4:33am

Hi,

I have 16S data from MinION that is already demultiplexed, basecalled and I got the taxonomic classification result for those as well. Now I have taxa id, read id, run id and accuracy for each samples. I was wondering whether it is possible to perform diversity analysis in qiime2 using these data? If so, what format should it be imported in ?

Thanks!

Nicholas_Bokulich · April 11, 2019, 12:22pm

You will need to obtain a biom-format feature table and import as a FeatureTable[Frequency] artifact. It is not clear to me what the shape of your data is, so I recommend checking out the biom-format documentation to determine how to format your data correctly.

Sequences must be in fasta format. Taxonomy must be in tab-delimited format like this:

seq-id <tab> taxonomy

If that is unclear, I recommend exporting example data files from the tutorials to examine their contents and format your data accordingly.

You can import your fasta sequences like this:

qiime tools import \
  --input-path sequences.fna \
  --output-path sequences.qza \
  --type 'FeatureData[Sequence]'

and tab-delimited taxonomy like this:

qiime tools import \
  --input-path taxonomy.tsv \
  --output-path taxonomy.qza \
  --type 'FeatureData[Taxonomy]'

Shruthi · April 17, 2019, 7:08am

Hi,
Thanks for the response. The taxonomic classifications are working great!
However, for (phylogeny) diversity analysis, I require the feature table(sequence) which I do not have. I have my reads mapped to the NCBI taxid. Is it possible to generate sequence table from the NCBI taxid?

Thanks!

Nicholas_Bokulich · April 17, 2019, 11:29am

I think not — unless if you are able to extract that information from NCBI.

One possibility would be to pull the sequence for each taxon's type strain, but that would be labor-intensive and inexact.

I'd recommend simply using non-phylogenetic methods for your analysis if you have no way to get the sequence information.

devonorourke · April 17, 2019, 11:50am

Are you talking about the 'sequencing_summary.txt' file output from Albacore?
How did you perform basecalling? Did you run Centrifuge for classification?

Post a few lines of the files you have. Assuming you have the demultiplexed fastq's, you can get a .biom file in a few steps.

Also, did you trim the adapters with Porechop (or equivalent)? Did you error correct with Nanopolish (or equivalent)? And if no, then did you classify with the raw reads? That's not going to be what you want to do if you're interested in diversity - there's way, way to much noise.

Shruthi · April 17, 2019, 11:47pm

@devonorourke

I performed the demultiplexing in deepbinner, basecalling in guppy and analysed the fastq files from guppy in EPI2ME platform (16S analysis). I used the csv files generated from the platform (both QC and taxa files), merged it, filtered the reads which were below the threshold and kept only the columns that are required for constructing a feature table (I have a set of illumina data for the same samples; I downloaded the feature table into biom and then into tsv file to access it and made my minION read table look similar using pandas). For the taxonomic table, I mapped the taxid to NCBI database to pull the entire taxonomic rank and ordered it using python; so in the end, it produced a table which was similar to the feature(taxonomy) table that is generated in qiime2.

Hope it was understandable

devonorourke · April 18, 2019, 12:49am

Using any software from Ryan Wick is a good idea in my opinion !

The last time I ran EPI2ME was over 8 months ago, and the classifier under the hood was Centrifuge. A quick glance of the Community Nanopore forum doesn't suggest that anything's changed, but that was just a cursory search. It might be worth posting your questions on that forum too.

In my humble opinion, there are a few things to consider before trying to shoehorn the Nanopore EPI2ME workflow into QIIME. First, note that EPI2ME is set up for speed, and that's exactly why it's using Centrifuge. It's a short read aligner that leverages a kmer sketch of a database - it's not a global alignment analogous to something like VSEARCH (see their paper for more details). This means you can rapidly classify loads of sequences; that's great for real time sequencing when you are shooting for a sort of 30,000 foot view perspective. I'm not so sure it's what you want if you're going to calculate alpha or beta diversity though, especially if you haven't corrected your raw reads.

To further complicate matters, it's important to note that Centrifuge (and therefore EPI2ME) is not using the same database typical to most QIIME users - it's not Greengenes, it's NCBI. Does that matter to you? Could your resulting classifications be different in part because of a database that is perhaps less well curated? Note that you can run Centrifuge with your .fastq files directly without using EPI2ME at all, and you can build whatever database you want for Centrifuge to work with. It might be interesting for you to test how their default NCBI database compares with something like Greengenes. I'd certainly like to know.

One other thing to circle back to: EPI2ME is probably not correcting your reads prior to classification, and this is absolutely something to resolve if that's the case. It always seemed like the prepackaged workflows through EPI2ME were a few versions behind of their standalone software, so I'd suspect that even though you ran the data through Guppy, you probably could improve your read and consensus accuracy with Nanopolish. It's unclear whether that's the case though, because it's not clear which version of Guppy you're running - if you can post the specific versions of the software you've used that'll help. See Ryan's preprint about basecaller comparisons - you'll find that Guppy certainly is the way to go if you're using the most recent version, but the larger improvements to cleaning up the noisy reads can also be related to training your classifier with your own data ahead of time.

Let's circle back to your original question:

The short answer is of course you can perform the tests; the question you're going to wrestle with is if the results are worth considering if you go about it the EPI2ME way using Centrifuge, or if you want to take those fastq files and make more of a manual effort to classify things with a different approach. Both this dog study and this sludge paper use MinION 16S data, but both use something other than EPI2ME to get their data classified.

What you'd probably want to do - shout out to @Nicholas_Bokulich here - is to run a 16S experiment with one or several mock communities. Until you have a known community, you're just guessing at which method is better. One step in that direction is this paper which did this for the Zymogen mock community, but they did full metagenomic, not just 16S, so you can't really use it as a benchmark for what you're doing. But hey, that's good news - an opportunity for an experiment!

I think you want to check in on the Nanopore forums first before QIIME to get a sense of how to tackle that question. There are hundreds of Nanopore users doing 16S work - connecting with those folk might be your best bet to get help with workflows tackling questions of diversity. A few Twitter folks to consider following: Arwyn Edwards (@arwynedwards), Devin Drown (@ArcticBiology), Mads Albertsen (@MadsAlbertsen85)... there are many others.

Good luck

Shruthi · May 2, 2019, 1:34am

@devonorourke Thanks for the explanation. The workflow I used in EPI2ME was 16S analysis and I believe it used BLAST for alignment (WIMP workflow uses K-mers). I have corrected the reads (filtered based on quality score, base length, trimmed adapters and barcodes, removed unclassified or unsuccessful reads prior to importing into qiime2 - if that is what you meant).

I did do a quick comparison between 3 databases where I used my Illumina data for greengenes and silva database and MinION 16S data with NCBI database. And yes there some major difference between the database (even for the same illumina data).

I used Guppy basecalling software version 2.3.7+e041753. I did also visualise the quality of the reads using PycoQC. Thanks for your time answering the question and giving nice feedbacks. I will try different approaches and see how it goes.

Thanks,
Shruthi

devonorourke · May 3, 2019, 10:49am

Just to clarify for anyone reading this thread in the future:
@Shruthi - your strategy of read quality control is what I'd call "read filtering", which is fundamentally distinct from what I was describing earlier with regards to read correction.
Filtering, in my mind, include parameters you've described: length, average Phred score, barcode ID, etc. That's all important stuff, so good job there!
Correction: This is where you use the raw signal from the fast5 files and use these trends to correct for your fastq base calls. While Guppy (and previously Albacore) use the signal itself to do base calling, there are orthogonal means with which the raw data can be further trained to get a better estimate of the base pair being assigned to a given position (in most but not all cases). Thus, using a program like Nanopolish is essential in my mind if you're inferring taxonomies in something like a 16S study. If your job was assembling a single genome, then these errors are likely going to be less problematic because of their (relatively) random distribution among your fragmented sequences. During assembly, these errors shouldn't be too problematic because for any given base position you will have some depth of coverage where you are likely to have more correct (ie. higher Phred) bases reaching a consensus. However with 16S reads, these aren't all from one organism, so the depth of coverage isn't necessarily a good parameter to filter data by (though it still might be valuable!).

Keep in mind that with Illumina data the error rates are pretty low so we tend to keep all reads of any abundance, but this may not be a good idea for uncorrected Nanopore reads. And even then, denoisers like DADA2 or Deblur are advised because we know that Illumina-based sequencing has enough noise to need adjustment of the raw data!
My guess is that without error correction via Nanopolish (or equivalent of), you're going to find that you gave an extraordinarily large number of sequence variants. I'd strongly encourage at least exploring that option and comparing how the uncorrected data compares.

Shruthi · May 6, 2019, 5:49am

Thanks @devonorourke.

I will try running the data through nanopolish to fix the reads before proceeding!

devonorourke · May 6, 2019, 3:40pm

@Shruthi - terribly sorry but now that I'm looking into alternative 16S workflows I'm realizing that I don't think Nanopolish is built to work with 16S data. The program's modules are built to operate with two inputs: your draft fastq file, and your raw fast5 reads. You can certainly create a draft "metagenome" - that's just your dereplicated dataset that you get after running any number of the various QIIME pipelines (via VSEARCH, DADA2, Deblur, etc.). Alternatively you could basecall as you've described, and when you get all your fastq files filtered however you want it, just run something like this (from the VSEARCH Wiki):

$VSEARCH --threads $THREADS \
        --derep_fulllength $s.filtered.fasta \
        --strand plus \
        --output $s.derep.fasta \
        --sizeout \
        --uc $s.derep.uc \
        --relabel $s. \
        --fasta_width 0

The problem I'm worried about is that the assumptions the program has built internally are suited for correcting a single individual draft genome, and I would suppose those assumptions are quite different than managing 16S data. Nevertheless, their release notes suggests that they've since added features for dealing with direct RNA data (in addition to genomes), so perhaps they are working on incorporating additional types of input datasets. One thing to note is that there are plenty of folks who use Nanopolish to correct for the instance where they are sequencing just a single amplicon from a single individual (ie. BRCA1 from one patient) - so amplicon data can work, but at the moment it looks like those amplicons all have to be from the same individual. Here's one example of a group using both 16S and COI markers with Nanopolish in their workflow; however, they are working with individual animals, not mixed communities!

If it turns out Nanopolish isn't going to be able to correct the reads, then perhaps filtering your data is the best we can hope for. To that end, I'd suggest checking out this paper that used metagenomic 16S data from a MinION. You'll notice that relied on multiple pipelines to compare outputs, and I'd certainly advocate for their willingness to try different approaches. For example, filtering by retaining only reads that have some NCBI ID classified through Centrifuge is one way of tossing out "junk", but it also might bias real data - what if you have really interesting 16S data that is just "Unclassified"? Perhaps another way of sifting through your data is to explore a Bayesian classification approach with Naive Bayes classifiers.

Sorry I don't have more information at the moment - it's not an area I've been actively involved with for more than a year now. I'm going to the London Calling event at the end of this month and will certainly ask around for what the best minds are up to with regards to dealing with 16S data.

I started asking on Twitter to a few folks I know do good work with 16S Nanopore data, so hopefully following this thread will lead to something helpful.

Cheers