I have fastq files with no barcodes, just reads with quality.
My task is to map those reads onto the taxonomy tree (Greengenes 13.8) and get abundancies, as described in the Moving Pictures tutorial.
The problem is that importing fastq files with samples (but with no barcodes) is only possible through the manifest file format. It looks like doing taxonomy/abundancies with those imported files is not trivial. They have this kind of metadata:
At the same time rep-seqs.qza (which is then used in taxonomy/abundancies) from the tutorial has the following metadata:
I’m a little uncertain what you mean, are you just trying to get taxonomic assignments (to be visualized in qiime taxa barplot or used just as a feature-table with qiime taxa collapse), or are you going for something else?
I think there’s a misunderstanding here. There’s multiple steps to get to your taxonomy and abundance information:
You start with your SampleData[SequencesWithQuality] which are basically your demultiplexed sequences (no quality control has happened, and we haven’t really mapped out which sequences look interesting or counted them yet). This is usually what you want to import.
From there we perform a denoising step qiime dada2 denoise-single or qiime deblur denoise-16s which will manage error (correcting or discarding) any sequence that looks like a mistake from the sequencing instrument. That step will result in a number of “correct” sequences and will count them. This is basically our “feature selection” step. After we’ve denoised the original sequences and counted them we have what are called Amplicon Sequence Variants (ASVs) which we’ll use as our features (these are analogous to 100% OTUs).
The “correct” sequences are the FeatureData[Sequence]. These no longer have sequencing abundance information, and instead are focused on just what the sequence is for each feature ID (it’s basically a fasta file without any replicated sequences or IDs).
The counts are in the FeatureTable[Frequency] which has the number of times each feature ID is observed in each sample ID.
Using the FeatureData[Sequence] we can do taxonomic classification using the feature-classifier plugin, giving us a new FeatureData[Taxonomy] artifact. We can use that new artifact with the FeatureTable[Frequency] one because they still have the same feature IDs.
Let me know if that clarifies things, or if I’ve misunderstood what you are going for!
Thanks. It looks like I’ve understood how it works. But again there are problems and questions…
First of all, I import 2 files of size 140Mb each that correspond to the forward and reverse reads of the demultiplexed fastq files.
qiime tools import
Yup! FASTQ compresses very well, we store the files as fastq.gz and then the .qza format is actually a zip file also, so getting a smaller size is pretty expected.
You are actually running denoise-single so overlap doesn’t matter as it’s only using your forward reads at the moment. (You can use denoise-paired to have them be merged). Looking at my last reply, it looks like I gave you the wrong method. Sorry!
As for why it is so small, I would suggest running feature-table summarize on your table to check on that. Generally we expect your table and rep-seqs to be much smaller than your raw sequence data. While a few Kb does sound unusually small, it’s not impossible (our tutorial data is ~50kb for instance, but it’s heavily subsampled to run quickly).
-rw-r--r-- 1 root root 56099896 Oct 4 15:46 paired-end-demux.qza
-rw-r--r-- 1 root root 13605 Oct 5 11:35 rep-seqs-dada2.qza
-rw-r--r-- 1 root root 9941 Oct 5 11:35 table-dada2.qza
The size of rep-seqs-dada2.qza is very small. I opened it as a zip archive and looked into dna-sequences.fasta file. It contains only 200 lines while the same file in the paired-end-demux.qza containes 870000 lines.
What would be the main reason for such reduction? Are my reads of bad quality or what?
The fasta file in your FeatureData[Sequence] artifact (rep-seqs-dadad2.qza) are representative sequences which means the reads have already been dereplicated. So it’s not necessarily fair to compare it to your demultiplexed sequences, as those are more or less just raw reads. You would expect to see many many repeats of the same sequence variant. So what we do is just tally up the number of times we see a given sequence variant in your feature-table. This way we don’t have to store the same redundant sequences thousands of times.
You should inspect the results of running qiime feature-table summarize on table-dada2.qza to see how many features were really observed (it sounds like you’ll have roughly 100 features though).
Without knowing more about your dataset, I can’t really say whether that feature count is low or not. But you should know that DADA2 expects sequences to be strictly biological. So you should ensure that your trim-left parameter is enough to remove any primers/adapters that may be present as these can confuse the error model.
Plugin error from dada2:
An error was encountered while running DADA2 in R (return code 1),
please inspect stdout and stderr to learn more.
Debug info has been saved to /tmp/qiime2-q2cli-err-hxvkb3s8.log
The contents of /tmp/qiime2-q2cli-err-hxvkb3s8.log: dada2.error.txt (2.8 KB)
(the results are the same if I try with 120 instead of 250)
2) Learning Error Rates
Initializing error rates to maximum possible estimate.
Error rates could not be estimated.
Error in err[c(1, 6, 11, 16), ] <- 1 :
incorrect number of subscripts on matrix
I’ve never seen this before. @benjjneb, do you know what causes this?
Looking at your sample fastq file, every quality score is I which is pretty unusual, it doesn’t look like you actually have any quality information for DADA2 to use (which is probably why we get that error).
Deblur doesn’t use the quality score to denoise the sequence data, so your quality scores being all written as the letter I don’t bother it.
The reason the files are so small is because we find the representative sequences and then count the number of times that each occurs in every sample in the feature-table. That way we don’t have to hold onto a bunch of reads that say the same thing (making it much smaller than your raw data).
I think both tools can make strong arguments for their use. Here are their respective papers:
If I’m reading the provided sample correctly, this appears to be single-end 454 data with fake quality scores (everything is Q=40). Is that correct?
The DADA2 plugin will fail on such data, as it is expecting a range of scores as in real data. While you can run DADA2 on 454 data, it needs the real quality scores. (Note to self: add flag to plugin to allow running w/o quality scores).
I am not a Deblur expert, but if this is 454 data you might want to double-check with @wasade on the right way to proceed, as Deblur might be intended for Illumina data?
Yes, that’s true, the score is fake. HMP datasets are in fasta format without scores, so I’ve put fake ones for QIIME to properly load data. (Btw, Is there the other way around to load demultiplexed fasta/fastq data to QIIME artifacts without manifests?)
Speaking about the representative sequences… I made a simple script to count different sequences in SRS011157.fsa. It turns out that 35230 of 37707 reads are unique. Why do I get with QIIME just hundreds then? What do you mean by “representative sequences”? How much should sequences be similar in order to be considered by QIIME as representative? What if I want to have in the filtered file (after dada2/deblur procedure) all the uniques?
How do I check counts for representative sequences? It somehow should be done with
I see, so it is very unlikely you’ll be able to get the actual quality scores to use DADA2
Yup, for fastq, there are two others which assume Illumina/Casava filenames, which probably won’t help you much for this data.
And then for fasta, we support the seqs.fna for legacy compatibility with QIIME 1. This tutorial explains how you would use that data in practice. This might be an option, since you don’t actually have raw-data.
These are some great questions!
To give some background, QIIME 2’s types revolve around this notion of Sample and Feature. We have SampleData[...] which tells us things about your samples (usually raw sequence data, or something like alpha-diversity), and we have FeatureData[...] which tells us things about your features (the representative sequences fall into that as FeatureData[Sequence]).
The FeatureTable[...] combines these two with a contingency table, telling us how many times a feature was observed for a given sample.
All of that is well and good, but it doesn’t actually define what a feature must be. This is actually pretty useful because it means that we don’t need to have a strong definition of sequence similarity (or even that a feature has a sequence). But when features do have sequences associated with them, we call it a FeatureData[Sequence] and usually refer to it as our representative sequences. We expect this data to have been dereplicated and to be DNA (ideally of the same amplicon).
So when you use DADA2 or Deblur, you are getting what we call Amplicon Sequence Variants (ASVs) as your features. Since the ASVs are themselves sequences, so we use those as the representative sequence for a given feature (easy!). The goal is that these ASVs are your “true” biological sequences, where all of the sequencing error has been denoised. This is why you have so few sequences after this step, because they believe that most of your data isn’t actually real sequence variation, but rather sequencing error. (Now it does sound like Deblur hasn’t been validated to work with 454, so it’s possible that Deblur is very wrong about the actual sequences in your case). Because of the way these tools behave, the concept of sequence similarity doesn’t mean very much to them.
On the other hand you have techniques like OTU clustering, where the concept of sequence similarity is very important. For example, if you used the vsearch plugin, then your resulting features would be standard OTUs.
Then I think you want to actually skip DADA2/Deblur and use qiime vsearch dereplicate-sequences. It will simply select every unique sequence to be its own feature (which you can then choose to cluster, or not).
An easy thing would be to just transfer your .qzv files and look at them in https://view.qiime2.org on your local computer’s browser.
Using and interacting with the forum will automatically promote your user-level over time (there’s several levels). See here for the explanation. You are doing great so far!