Clustering returns almost no features

mat.lesche · November 26, 2018, 6:06pm

Dear All,

We are trying to establish a 16s library prep in our group and have used ZymoBIOMICS™ Microbial Community DNA Standard (ZymoBIOMICS Microbial Community Standards | ZYMO RESEARCH) for the test library prep. We try to target the V4 region and we are using this kit, so we know which species are to expect.
I'm following the guidelines for the general workflow (Overview of QIIME 2 Plugin Workflows — QIIME 2 2018.11.0 documentation) and I'm using SingleEnd data because the overlap of the PairedEnd data had an overlap which was too short (<= 6bp). I have already applied the following steps

quality filtering (resulting in fragments with a size of 140-150 bp)
denoising with dada2 denoise-single

We created 5 libraries with about 500,000 fragment each. The denoise step reduces my sequences to 147 unique features over these 5 libraries. That was a bummer but to be expected due to the small region which was amplified.
Next I want to do the clustering with
qiime vsearch cluster-features-closed-reference
and I used the 'both' option for strand parameter

My input are the files from the denoise step and the reference from the ZymoBIOMICS kit (https://s3.amazonaws.com/zymo-files/BioPool/ZymoBIOMICS.STD.refseq.v2.zip). I created as FeatureData|Sequence artefact as described here Importing data — QIIME 2 2018.11.0 documentation

However after running the clustering step only 2-3 features on these references are identified and I can't find out why that is the case. As a control, I aligned these 147 features to the genome with the gsnap aligner and got about 30 unique and perfect matches over the 8 reference genomes. The other had 9-15 soft clips. So these 147 sequences exist but vsearch cluster-features-closed-reference can't seem to find them. So I started to look in to the files, the forum and the tutorial sides and found this site: Training feature classifiers with q2-feature-classifier — QIIME 2 2018.11.0 documentation

I created a shorter reference by using
qiime feature-classifier extract-reads --i-sequences Zymo_v2.qza --p-f-primer GTGYCAGCMGCCGCGGTAA --p-r-primer GGACTACNVGGGTWTCTAAT --p-min-length 100 --p-max-length 400 --o-reads Zymo_v2.extract.qza

and again only 2-3 features are identified. I have to add that only 2 references only remained in my reference, namely Pseudomonas aeruginosa and Sacharomyces cerevisiae. And now I'm running out of ideas and have the following questions

Am I missing something in the clustering step?
Is the import of the reference correct
Is applying qiime feature-classifier extract-reads correct and if so should I use --p-trunc-len option with 150 as value? Even so the tutorial does not recommend it.
How would I proceed with creating a taxonomy classifier from this reference?
How would I create a X% reference-sequence similarity dataset? (as mention in the tutorials)
What is the general sequencing depth of a 16s library? (as shown above 500,000 fragments seems way to much)

Thanks for you help
mathias

jwdebelius · November 27, 2018, 9:57am

Hi @mat.lesche

As I'm reading through your post there are a couple of things that concern me. So, I want to start by suggesting a pipeline, talking you through some issues that came up, answering your questions, and then asking a few of my own.
This is a rather long post, and I'm not sure if Im sorry or just recommend coffee. Maybe both?

Recommended Pipeline

Okay, so first, I think your pipeline is a little bit extreme. I'd suggest one of the following:

ASV DADA2 : pass your single fragment reads into DADA2 according to the moving pictures tutorial. Then, do taxonomic classification on the reference seqs using your favorite classifier and build your tree with either fast tree or fragment insertion.
ASV with Deblur : Do quality filtering, set a trim length, and run Deblur according to the tutorial. Then, do taxonomic classification on the reference seqs using your favorite classifier and build your tree with either fast tree or fragment insertion.
OTU Picking : Use vsearch according to the tutorial pipeline

Most importantly, don't mix and match between the three. Each has a recommended flow, and doing multiple things doesn't buy you much other than additional compute.

The red flag

Okay, so red flag: if you're running a 16s experiment, you should not see S. cerevisiae or Cryptococcus neoformans. Both are fungi and shouldnt be picked up by your 16s V4 primers. If you want to do fungi, you need either ITS or 18S primers.

Looking at the rest of your mock community, it looks like you should see about 12 organism. You might have lower resolution on the E. coli and maybe Salmonella because taxonomy and phylogeny rarely match and its a source of perpetual frustration for many people.

Okay, my red flag warning addressed, lets look at your questions...

Possibly, but more conceptual and less bioinformatic. I'm wondering why you're trying to cluster ASVs. Most pipelines either use ASVs or OTUs, as they're essentially addressing the same issue: overcoming sequencing noise/error. The whole history of OTUs vs ASVs and when you should use what is its own topic.

So, my recommendation here would be to either cluster closed reference or use the ASV results. I tend to go with ASVs myself, and have switched over almost entirely. (Unless Im running a meta analysis on different hypervariable regions. Then I always used closed reference because nothing else gives comparable results.)

This is a second step that's a bit weird to me. AFAIK, Dada2 has its own quality filtering built in, so passing quality filtered reads doesn't buy you anything. If you're running Deblur, you need to do quality filtering, and trimming. But, that's another algorithm.

It looks like your import is fine.

The length of your reference for clustering doesn't matter as much. A lot of people cluster against full length greengenes or silva sequences. However, clustering against sequences that are shorter is going to result in clusters that don't match. Fragment length matters in terms of clustering and denosising patterns, so that may explain some of your failure to cluster. So, for your 140-150 bp reads, I would definately build my classifier at 150 bp.

Once you have the reference, I'd follow the tutorial for creating classifiers

I would recommend not doing this, and just using your ASVs.

This is a function of sequencing 5 samples on your Miseq(?) run. Ive seen anywhere between 100 and 600 samples run on a 16s run, which obviously affects your depth. My personal opinion is that you get more bang for your buck with more samples at shallower depths. Sure, if you only have 10,000 seqs/sample in 100 samples, you don't have power to detect something that's present at 1/100,000. But, if you have 10 samples at 100,000 you don't have power to detect it either. However, if I have 10,000 samples and I have something that's present in 1/10,000 and it shows up in 20 samples, I can do some statistics with it. Where as if I have something thats present at 1/10,000 in my study with 100,000 seqs/sample in 10 samples, then I dont have statistical power to do something with 2 samples. Particularly after I pay my FDR penalty.

Sorry for the info dump, and hopefully some of this is helpful.

Best,
Justine

mat.lesche · November 27, 2018, 4:25pm

Dear Justine,

Thanks a lot for taking the time to write such an extensive reply. It was really helpful for me and you pointed me to some conceptual mis. It's the first time that I run the software and 16s in general. I misunderstood the tutorial and especially the Denoise/Cluster step. For some reason, I did not grasp that the overview/steps for the clustering comes after vsearch dereplicate-sequences and that one should not mix denoise with clustering. I assumed the clustering example is referring to the alignment step which is followed by phylogeny ...
For now I stick to denoise with dada2 and will now try to do create the classifier and the phylogeny step with the representive sequences. I have just a question regarding the classifier step.
The tutorial for creating classifiers mentioned a taxonomy file (e.g. 85_otu_taxonomy.txt). I looked a the file and understand the structure but I'm wondering how I can do this with reference genomes. Is there a page that helps one build or query for a taxonomy of a organism?

Why do refer to 12 organism? The mock community mentions only 10 and 2, as explained by you, of them should not come up (Sachharomyces cerevisiae and Cryptococcus neoformans).

I completely agreed with that. This was a test for us and the depth was most certainly to high. I'm pretty sure we'll end up with around 25,000-50,000 fragments per sample in the future.

Thanks
Mathias

jwdebelius · November 27, 2018, 4:43pm

Hi @mat.lesche,

Welcome to 16S and the forum, then!

So, its entirely possible that I hadn't had enough coffee this morning, and miscounted. I retract my previous statement: you should see 8 organisms in your mock community, not 12.

With regard to denoising, that's not actually done on dereplicated sequences. You want your demultiplexed, but not dereplicated sequences. So, if you're using DADA2, your pipeline should come out more like this:

Demultiplex or load demultiplexed sequences into QIIME
Run DADA2 on single end sequences
Build your classifier
Build your tree
Enjoy your analysis! :qiime2:

Best,
Justine

mat.lesche · November 28, 2018, 10:02am

Hi Justine,

In a way you were correct. I checked the fasta files from the reference and 2 of them have more than one reference, namely Staphylococcus_aureus (1 chromosome reference plus 3 plasmid) and Escherichia_coli (one chromosome and one plasmid reference)

That's the part I'm struggling a bit at the moment. I'm following Feature Classifier Tutorial and I'm at the ** Extract reference reads** and my code is the following

qiime feature-classifier extract-reads --i-sequences Zymo_v2.qza --p-f-primer GTGYCAGCMGCCGCGGTAA --p-r-primer GGACTACNVGGGTWTCTAAT --p-min-length 100 --p-max-length 400 --p-trunc-len 120 --o-reads Zymo_v2.extract.qza

I checked the fasta file inside Zymo_v2.extract.qza artefact and only a single sequence is left, namely the part on the sequence of the Pseudomonas Aeruginosa reference. Now I know that these primers exists on the other references as well because I did "grep" for the forward primer and a "grep" with the reverse complement of the reverse primer

grep -oP "ATTAGA[A|T]ACCC[C|G|T][A|C|G|T]GTAGTCC" *fasta
but somehow qiime feature-classifier extract-reads can extract them.

jwdebelius · November 28, 2018, 10:11am

Hi @mat.lesche,

Would you be willing to share your --i-sequences Zymo_v2.qza artefact to help with trouble shooting.

Did you merge your reference files or reference together in some way? That might explain your lack of communities.

In the mean time, have you tried running classification off a different reference? With something like greengenes or Silva, you should still get a "good enough" approximation of hte community structure. The names may not be identical, but you should still get a good sense of your results. (For example, your E. coli will stop mapping at f. Enterobacteraceae, for instance, because taxonomy and phylogeny don't always relate well.) However, in terms of things that may be more useful long term, a classifier based on greengenes or Silva may be of more use to you in future analyses with real samples.

Best,
Justine

mat.lesche · November 28, 2018, 3:19pm

Thanks for having a look at this

Sure. No problem. I dowloaded it from this link: https://s3.amazonaws.com/zymo- files/BioPool/ZymoBIOMICS.STD.refseq.v2.zip

Here is the qza file: https://sharing.biotec.tu-dresden.de/index.php/s/LK61Pad1OVCAgA6

I used this file and the following code for extracting the V4 16s regions

qiime feature-classifier extract-reads --i-sequences Zymo_v2.qza --p-f-primer GTGYCAGCMGCCGCGGTAA --p-r-primer GGACTACNVGGGTWTCTAAT --p-min-length 100 --p-max-length 400 --o-reads Zymo_v2.extract.qza

After this step only this one remained:

">"Pseudomonas_aeruginosa_complete_genome
TACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCAGCAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTACTGAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGG

If I grep for GTG[C|T]CAGC[A|C]GCCGCGGTAA which is the forward primer and ATTAGA[A|T]ACCC[C|G|T][A|C|G|T]GTAGTCC which is the reverse complement primer of the reverse primer, then I find the sequence too.
But I also find these primer site in the other 7 references. The only difference is that the other 7 reference genomes have more than 1 primer site. For example Bacillus_subtilis_complete_genome has 9 sites for this V4 region. Does feature-classifier extract-reads has an issue when there is more than one site in the reference which matches the primer.

Here are just two examples out of the 9 sites of Bacillus_subtilis_complete_genome

GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGCAGGCGGTTCCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCC

GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGCAGGCGGTTCCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCC

The reference files were created as follows:

download the reference (each genome is a fasta file)
removed the non 16s references (2 were removed)
rename the reference names
do concantenation

cat *.fasta > Zymo_v2.fa

change acgtn to ACGTN in the sequences with a custom script
removed the plasmid references (4 were removed
at the end 8 references remain
run qiime tools import on the Zymo_v2.fa

No not yet. But I can give that a try.

jwdebelius · November 28, 2018, 3:30pm

Hi @mat.lesche,

Did you, by any chance, use the sequences from the genomes folder? Could you please try the steps again using the fasta files from the ssRNA folder?

Best,
Justine

mat.lesche · November 28, 2018, 3:35pm

I did use the genomes and I can try it with the ssRNA folder. But the ssRNA folder contains for the single organism multiple 16s entries. For example Staphylococcus_aureus_16S has 6 entries. If I go through the steps for creating a reference, then I have Staphylococcus_aureus 6 times in it. I though one should start with the genome for the building the reference

jwdebelius · November 28, 2018, 3:56pm

It's common for references to have multiple 16s rRNA sequences that map to the same organism with small variants. So, that you have 6 reference sequences for S. aureus isn't entirely unexpected, nor is it a bad thing.

...I've also just realised that the formatting on the sequence names may not be correct. I tend to work with greengenes which are formatted as

k__kingdom;p__phylum;c__class;o__order;f__family;g__genus;s__species

(i.e. k__Bacteria;p__Firmictues;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;s__aureus)

It may be that the classifier works better with this information, Ive never tried building a classifier where the data wasn't formatted with way.

Best,
Justine

mat.lesche · November 28, 2018, 4:15pm

That worked: Zymo16s.extract.qza (10.5 KB)

But it doesn't really explain why the program wasn't able to extract the same sequences from the genome because they are in there too. My taxonomy looks like this

Staphylococcus_aureus_chromosome k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Staphylococcaceae; g__Staphylococcus; s__aureus

jwdebelius · November 28, 2018, 4:30pm

I think you can try to drop the Staphylococcus_aureus_chromosome header, or just classify your taxonomy and see how your chart looks. If it comes out with weird descriptions, I'd go back and pull those off and re-extract. Because you shouldn't have a lost of ASVs, classification should (I think) be relatively quick.

Best,
Justine

mat.lesche · November 28, 2018, 5:10pm

Hi Justine

Thanks so much. You were such a great help!

I finally got a taxonomy which actually shows the organisms from the Zymo Kit. It's not in the provided theoretical composition but all eight organism are there.
ref-taxonomy.qzv (1.2 MB)
bfx1037-taxa-bar-plots.qzv (327.5 KB)

Using only the 16s sequences for constructing the references did the trick. That was a great suggestion. Before I tried to create the taxonomy based on the genome reference and only a single organism came up. This part of qiime (building the reference sequences and taxonomy) seems to have an issue when working with the whole reference, instead of the short V4 regions.
Now I much more excited to work on this tomorrow. Of course this is just a mock up test but it already shows, that the prep works and the much more interesting part will be real experiment and using references like green genes.

Thanks again and I'm pretty sure I'll have more questions

system · December 29, 2018, 11:10pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.