Doing taxonomy analysis and getting abundancies with manifests

stanislav.iablokov · September 11, 2017, 4:57pm

Hello.

I have fastq files with no barcodes, just reads with quality.
My task is to map those reads onto the taxonomy tree (Greengenes 13.8) and get abundancies, as described in the Moving Pictures tutorial.

The problem is that importing fastq files with samples (but with no barcodes) is only possible through the manifest file format. It looks like doing taxonomy/abundancies with those imported files is not trivial. They have this kind of metadata:
type:"SampleData[SequencesWithQuality]"
format:"SingleLanePerSampleSingleEndFastqDirFmt"

At the same time rep-seqs.qza (which is then used in taxonomy/abundancies) from the tutorial has the following metadata:
type:"FeatureData[Sequence]"
format:"DNASequencesDirectoryFormat"

Any idea on how to deal with this problem?

ebolyen · September 11, 2017, 8:50pm

Hey @stanislav.iablokov,

I'm a little uncertain what you mean, are you just trying to get taxonomic assignments (to be visualized in qiime taxa barplot or used just as a feature-table with qiime taxa collapse), or are you going for something else?

I think there's a misunderstanding here. There's multiple steps to get to your taxonomy and abundance information:

You start with your SampleData[SequencesWithQuality] which are basically your demultiplexed sequences (no quality control has happened, and we haven't really mapped out which sequences look interesting or counted them yet). This is usually what you want to import.

From there we perform a denoising step qiime dada2 denoise-single or qiime deblur denoise-16s which will manage error (correcting or discarding) any sequence that looks like a mistake from the sequencing instrument. That step will result in a number of "correct" sequences and will count them. This is basically our "feature selection" step. After we've denoised the original sequences and counted them we have what are called Amplicon Sequence Variants (ASVs) which we'll use as our features (these are analogous to 100% OTUs).

The "correct" sequences are the FeatureData[Sequence]. These no longer have sequencing abundance information, and instead are focused on just what the sequence is for each feature ID (it's basically a fasta file without any replicated sequences or IDs).

The counts are in the FeatureTable[Frequency] which has the number of times each feature ID is observed in each sample ID.

Using the FeatureData[Sequence] we can do taxonomic classification using the feature-classifier plugin, giving us a new FeatureData[Taxonomy] artifact. We can use that new artifact with the FeatureTable[Frequency] one because they still have the same feature IDs.

Let me know if that clarifies things, or if I've misunderstood what you are going for!

stanislav.iablokov · September 19, 2017, 3:45pm

Thank you for your answer. Basically I'm trying to do the same thing as the following script does in QIIME 1.

#!/bin/bash

SEQ_DIR=/home/xxx/microbiom/qiime/V1/fastq
OUT_DIR=qiime_output
phred_offset=33

current_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd ${current_dir}

DB_PATH=/home/xxx/microbiom/qiime/GreenGenes_05_13_97
DB_TAXONOMY=$DB_PATH/97_otu_taxonomy.txt
DB_FASTA=$DB_PATH/97_otus.fasta
DB_TREE=$DB_PATH/97_otus.tree
num_jobs_pick=2


rm -rf ${OUT_DIR}; mkdir ${OUT_DIR}
# do prefiltering
echo "Preparing the files..."
mkdir ${OUT_DIR}/seqs
for file in `ls -1 ${SEQ_DIR}`; do
    sample=`echo ${file} | awk -F'.fastq' '{print $1;}'`
    echo ${file}
    echo ${sample}
    split_libraries_fastq.py -i ${SEQ_DIR}/${file} --sample_id ${sample} -o ${OUT_DIR}/seqs/${sample} --barcode_type 'not-barcoded' --phred_offset ${phred_offset} --phred_quality_threshold 19
done

# do merging
echo "Merge seqs..."
for seq_file in `find ${OUT_DIR}/seqs/ -name seqs.fna`; do
        echo $seq_file
        cat $seq_file >> ${OUT_DIR}/seqs/all_seqs.fna
        rm $seq_file
done

# do mapping
echo "Reads mapping..."
mkdir ${OUT_DIR}/map
start=$SECONDS
pick_closed_reference_otus.py -a -O $num_jobs_pick -i ${OUT_DIR}/seqs/all_seqs.fna -r $DB_FASTA -t $DB_TAXONOMY -p "cnf/pick_closed_reference_otus.properties" -o ${OUT_DIR}/otu
echo "done"
duration=$(( SECONDS - start ))
echo "mapping took, sec:"
echo $duration
rm -rf ${OUT_DIR}/map


# delete large files
for all_seqs_file in `ls ${OUT_DIR}/seqs | grep '\.fna'`; do
    rm ${OUT_DIR}/seqs/${all_seqs_file}
done


# convert biom to txt
echo "Converting BIOM..."
# original
biom convert -i ${OUT_DIR}/otu/otu_table.biom -o ${OUT_DIR}/otu/otu_table.txt --to-tsv --header-key taxonomy

So I want to load reads from a sample (a fastq or fasta file), get taxonomy (probably from GreenGenes 13.8 or another catalog) and abundancies in a format like:

Constructed from biom file
OTU ID	ERR186116	ERR186183	ERR186083	ERR186077	ERR186103	ERR186140	ERR186172	ERR186047	ERR186168	ERR186068	taxonomy
125270	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
312048	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Faecalibacterium; s__prausnitzii
851734	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Oribacterium; s__

What is the analog in QIIME2 and how to import/load GreenGenes catalog?

ebolyen · September 19, 2017, 11:20pm

Thanks for the script @stanislav.iablokov!

You should be able to follow the moving pictures tutorial. It goes over more analysis than you seem to need, however what you should pay attention to will be these steps:

qiime dada2 denoise-single (from the DADA2 section)
- creates the Amplicon Sequence Variants that we assign taxonomy to
qiime feature-classifier classify-sklearn (from the taxonomic analysis section)
- assigns the taxonomy to the ASVs (using Greengenes)
qiime taxa collapse (from the differential abundance section)
- re-labels the feature-table to use the taxonomy of FeatureData[Taxonomy]

After that, you should have a FeatureTable[Frequency] artifact that has the Greengenes taxonomic assignments as IDs, you can then use:

qiime tools export

to get a BIOM file from the artifact, and then your

biom convert -i ... -o ... --to-tsv --header-key taxonomy

commands from your script should do the job.

Let me know if you need more specific details or have other questions, but this should be what you need to get started!

stanislav.iablokov · October 4, 2017, 9:50pm

Thanks. It looks like I've understood how it works. But again there are problems and questions...
First of all, I import 2 files of size 140Mb each that correspond to the forward and reverse reads of the demultiplexed fastq files.
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest.csv
--output-path paired-end-demux.qza
--source-format PairedEndFastqManifestPhred33

manifest.csv:
sample-id,absolute-filepath,direction
sample-1,$PWD/../fastq/R1.fastq,forward
sample-1,$PWD/../fastq/R2.fastq,reverse

The resulting paired-end-demux.qza has size of 56Mb. Is that OK? As far as I could some kind of compression is applied, right?

Next, as it was told:
qiime dada2 denoise-single
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left 0
--p-trunc-len 250
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza

Those qza files has size of few Kb, which doesn't look good. Is that due to the low level of overlap between forward and reverse reads?

ebolyen · October 4, 2017, 10:11pm

Hey @stanislav.iablokov

Yup! FASTQ compresses very well, we store the files as fastq.gz and then the .qza format is actually a zip file also, so getting a smaller size is pretty expected.

You are actually running denoise-single so overlap doesn't matter as it's only using your forward reads at the moment. (You can use denoise-paired to have them be merged). Looking at my last reply, it looks like I gave you the wrong method. Sorry!

As for why it is so small, I would suggest running feature-table summarize on your table to check on that. Generally we expect your table and rep-seqs to be much smaller than your raw sequence data. While a few Kb does sound unusually small, it's not impossible (our tutorial data is ~50kb for instance, but it's heavily subsampled to run quickly).

You may get better results if you trim your reads a little bit more before denoising. Check out the moving pictures tutorial which provides some justifications for trimming its example data-set.

stanislav.iablokov · October 6, 2017, 4:54pm

I've done the same thing with denoise-paired:

-rw-r--r-- 1 root root 56099896 Oct  4 15:46 paired-end-demux.qza
-rw-r--r-- 1 root root    13605 Oct  5 11:35 rep-seqs-dada2.qza
-rw-r--r-- 1 root root     9941 Oct  5 11:35 table-dada2.qza

The size of rep-seqs-dada2.qza is very small. I opened it as a zip archive and looked into dna-sequences.fasta file. It contains only 200 lines while the same file in the paired-end-demux.qza containes 870000 lines.

What would be the main reason for such reduction? Are my reads of bad quality or what?

ebolyen · October 6, 2017, 10:02pm

@stanislav.iablokov,

The fasta file in your FeatureData[Sequence] artifact (rep-seqs-dadad2.qza) are representative sequences which means the reads have already been dereplicated. So it's not necessarily fair to compare it to your demultiplexed sequences, as those are more or less just raw reads. You would expect to see many many repeats of the same sequence variant. So what we do is just tally up the number of times we see a given sequence variant in your feature-table. This way we don't have to store the same redundant sequences thousands of times.

You should inspect the results of running qiime feature-table summarize on table-dada2.qza to see how many features were really observed (it sounds like you'll have roughly 100 features though).

Without knowing more about your dataset, I can't really say whether that feature count is low or not. But you should know that DADA2 expects sequences to be strictly biological. So you should ensure that your trim-left parameter is enough to remove any primers/adapters that may be present as these can confuse the error model.

stanislav.iablokov · October 23, 2017, 4:46pm

Let's go back to single end sequences.

I'm making tests on Human Microbiome Project (HMP) datasets.
My manifest.csv file is as follows:

sample-id,absolute-filepath,direction
sample-1,$PWD/../fastq/SRS011157.fastq,forward

Sample here: https://cloud.mail.ru/public/Efyn/bkBfTunjr

So I import it through:

qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path manifest.csv \
  --output-path single-end-demux.qza \
  --source-format SingleEndFastqManifestPhred33

Next I go with dada2:

qiime dada2 denoise-single
--i-demultiplexed-seqs single-end-demux.qza
--p-trim-left 0
--p-trunc-len 250
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza

and get this error:

Plugin error from dada2:
An error was encountered while running DADA2 in R (return code 1),
please inspect stdout and stderr to learn more.
Debug info has been saved to /tmp/qiime2-q2cli-err-hxvkb3s8.log

The contents of /tmp/qiime2-q2cli-err-hxvkb3s8.log:
dada2.error.txt (2.8 KB)
(the results are the same if I try with 120 instead of 250)

If I try deblur:

qiime quality-filter q-score
--i-demux single-end-demux.qza
--o-filtered-sequences demux-filtered.qza
--o-filter-stats demux-filter-stats.qza

qiime deblur denoise-16S
--i-demultiplexed-seqs demux-filtered.qza
--p-trim-length 250
--o-representative-sequences rep-seqs-deblur.qza
--o-table table-deblur.qza
--o-stats deblur-stats.qza

then it all goes fine, but the resulting deblured files https://cloud.mail.ru/public/HpdX/tbS36yutq are pretty small. What's going on? Why are they so small?

Could someone please also clarify me about the difference of dada2 and deblur? What is best for the analyse of, let's say, human microbiota samples?

ebolyen · October 24, 2017, 12:04am

Hey @stanislav.iablokov!

Looking through your logs (thanks for those!!):

Snippet from that file:

2) Learning Error Rates
Initializing error rates to maximum possible estimate.
Error rates could not be estimated.
Error in err[c(1, 6, 11, 16), ] <- 1 : 
  incorrect number of subscripts on matrix
Calls: dada
Execution halted

I've never seen this before. @benjjneb, do you know what causes this?

Looking at your sample fastq file, every quality score is I which is pretty unusual, it doesn't look like you actually have any quality information for DADA2 to use (which is probably why we get that error).

Deblur doesn't use the quality score to denoise the sequence data, so your quality scores being all written as the letter I don't bother it.

The reason the files are so small is because we find the representative sequences and then count the number of times that each occurs in every sample in the feature-table. That way we don't have to hold onto a bunch of reads that say the same thing (making it much smaller than your raw data).

I think both tools can make strong arguments for their use. Here are their respective papers:

Let me know if that helps!

benjjneb · October 24, 2017, 1:00am

If I'm reading the provided sample correctly, this appears to be single-end 454 data with fake quality scores (everything is Q=40). Is that correct?

The DADA2 plugin will fail on such data, as it is expecting a range of scores as in real data. While you can run DADA2 on 454 data, it needs the real quality scores. (Note to self: add flag to plugin to allow running w/o quality scores).

I am not a Deblur expert, but if this is 454 data you might want to double-check with @wasade on the right way to proceed, as Deblur might be intended for Illumina data?

stanislav.iablokov · October 24, 2017, 11:36am

@ebolyen, @benjjneb

Yes, that's true, the score is fake. HMP datasets are in fasta format without scores, so I've put fake ones for QIIME to properly load data. (Btw, Is there the other way around to load demultiplexed fasta/fastq data to QIIME artifacts without manifests?)

Speaking about the representative sequences... I made a simple script to count different sequences in SRS011157.fsa. It turns out that 35230 of 37707 reads are unique. Why do I get with QIIME just hundreds then? What do you mean by "representative sequences"? How much should sequences be similar in order to be considered by QIIME as representative? What if I want to have in the filtered file (after dada2/deblur procedure) all the uniques?

How do I check counts for representative sequences? It somehow should be done with

qiime feature-table summarize
--i-table table-deblur.qza
--o-visualization table-deblur.qzv

but on my server I don't have graphical interface:

Plugin error from feature-table:
Invalid DISPLAY variable
Debug info has been saved to /tmp/qiime2-q2cli-err-z7z3pl4j.log

What should I do then?

P.S. Please promote my forum level/status, so that I don't wait every time for post approval

wasade · October 24, 2017, 11:21pm

Thanks, @benjjneb! Deblur is not defined for 454 data.

ebolyen · October 25, 2017, 12:40am

Hey @stanislav.iablokov!

Sorry for the long post.

I see, so it is very unlikely you'll be able to get the actual quality scores to use DADA2

Yup, for fastq, there are two others which assume Illumina/Casava filenames, which probably won't help you much for this data.

And then for fasta, we support the seqs.fna for legacy compatibility with QIIME 1. This tutorial explains how you would use that data in practice. This might be an option, since you don't actually have raw-data.

These are some great questions!

To give some background, QIIME 2's types revolve around this notion of Sample and Feature. We have SampleData[...] which tells us things about your samples (usually raw sequence data, or something like alpha-diversity), and we have FeatureData[...] which tells us things about your features (the representative sequences fall into that as FeatureData[Sequence]).
The FeatureTable[...] combines these two with a contingency table, telling us how many times a feature was observed for a given sample.

All of that is well and good, but it doesn't actually define what a feature must be. This is actually pretty useful because it means that we don't need to have a strong definition of sequence similarity (or even that a feature has a sequence). But when features do have sequences associated with them, we call it a FeatureData[Sequence] and usually refer to it as our representative sequences. We expect this data to have been dereplicated and to be DNA (ideally of the same amplicon).

So when you use DADA2 or Deblur, you are getting what we call Amplicon Sequence Variants (ASVs) as your features. Since the ASVs are themselves sequences, so we use those as the representative sequence for a given feature (easy!). The goal is that these ASVs are your "true" biological sequences, where all of the sequencing error has been denoised. This is why you have so few sequences after this step, because they believe that most of your data isn't actually real sequence variation, but rather sequencing error. (Now it does sound like Deblur hasn't been validated to work with 454, so it's possible that Deblur is very wrong about the actual sequences in your case). Because of the way these tools behave, the concept of sequence similarity doesn't mean very much to them.

On the other hand you have techniques like OTU clustering, where the concept of sequence similarity is very important. For example, if you used the vsearch plugin, then your resulting features would be standard OTUs.

Then I think you want to actually skip DADA2/Deblur and use qiime vsearch dereplicate-sequences. It will simply select every unique sequence to be its own feature (which you can then choose to cluster, or not).

An easy thing would be to just transfer your .qzv files and look at them in https://view.qiime2.org on your local computer's browser.

Using and interacting with the forum will automatically promote your user-level over time (there's several levels). See here for the explanation. You are doing great so far!