Doing taxonomy analysis and getting abundancies with manifests

ebolyen · October 25, 2017, 12:40am

Sorry for the long post.

I see, so it is very unlikely you'll be able to get the actual quality scores to use DADA2

Yup, for fastq, there are two others which assume Illumina/Casava filenames, which probably won't help you much for this data.

And then for fasta, we support the seqs.fna for legacy compatibility with QIIME 1. This tutorial explains how you would use that data in practice. This might be an option, since you don't actually have raw-data.

These are some great questions!

To give some background, QIIME 2's types revolve around this notion of Sample and Feature. We have SampleData[...] which tells us things about your samples (usually raw sequence data, or something like alpha-diversity), and we have FeatureData[...] which tells us things about your features (the representative sequences fall into that as FeatureData[Sequence]).
The FeatureTable[...] combines these two with a contingency table, telling us how many times a feature was observed for a given sample.

All of that is well and good, but it doesn't actually define what a feature must be. This is actually pretty useful because it means that we don't need to have a strong definition of sequence similarity (or even that a feature has a sequence). But when features do have sequences associated with them, we call it a FeatureData[Sequence] and usually refer to it as our representative sequences. We expect this data to have been dereplicated and to be DNA (ideally of the same amplicon).

So when you use DADA2 or Deblur, you are getting what we call Amplicon Sequence Variants (ASVs) as your features. Since the ASVs are themselves sequences, so we use those as the representative sequence for a given feature (easy!). The goal is that these ASVs are your "true" biological sequences, where all of the sequencing error has been denoised. This is why you have so few sequences after this step, because they believe that most of your data isn't actually real sequence variation, but rather sequencing error. (Now it does sound like Deblur hasn't been validated to work with 454, so it's possible that Deblur is very wrong about the actual sequences in your case). Because of the way these tools behave, the concept of sequence similarity doesn't mean very much to them.

On the other hand you have techniques like OTU clustering, where the concept of sequence similarity is very important. For example, if you used the vsearch plugin, then your resulting features would be standard OTUs.

Then I think you want to actually skip DADA2/Deblur and use qiime vsearch dereplicate-sequences. It will simply select every unique sequence to be its own feature (which you can then choose to cluster, or not).

An easy thing would be to just transfer your .qzv files and look at them in https://view.qiime2.org on your local computer's browser.

Using and interacting with the forum will automatically promote your user-level over time (there's several levels). See here for the explanation. You are doing great so far!