Dear All,
We are trying to establish a 16s library prep in our group and have used ZymoBIOMICS™ Microbial Community DNA Standard (ZymoBIOMICS Microbial Community Standards | ZYMO RESEARCH) for the test library prep. We try to target the V4 region and we are using this kit, so we know which species are to expect.
I'm following the guidelines for the general workflow (Overview of QIIME 2 Plugin Workflows — QIIME 2 2018.11.0 documentation) and I'm using SingleEnd data because the overlap of the PairedEnd data had an overlap which was too short (<= 6bp). I have already applied the following steps
- quality filtering (resulting in fragments with a size of 140-150 bp)
- denoising with dada2 denoise-single
We created 5 libraries with about 500,000 fragment each. The denoise step reduces my sequences to 147 unique features over these 5 libraries. That was a bummer but to be expected due to the small region which was amplified.
Next I want to do the clustering with
qiime vsearch cluster-features-closed-reference
and I used the 'both' option for strand parameter
My input are the files from the denoise step and the reference from the ZymoBIOMICS kit (https://s3.amazonaws.com/zymo-files/BioPool/ZymoBIOMICS.STD.refseq.v2.zip). I created as FeatureData|Sequence artefact as described here Importing data — QIIME 2 2018.11.0 documentation
However after running the clustering step only 2-3 features on these references are identified and I can't find out why that is the case. As a control, I aligned these 147 features to the genome with the gsnap aligner and got about 30 unique and perfect matches over the 8 reference genomes. The other had 9-15 soft clips. So these 147 sequences exist but vsearch cluster-features-closed-reference can't seem to find them. So I started to look in to the files, the forum and the tutorial sides and found this site: Training feature classifiers with q2-feature-classifier — QIIME 2 2018.11.0 documentation
I created a shorter reference by using
qiime feature-classifier extract-reads --i-sequences Zymo_v2.qza --p-f-primer GTGYCAGCMGCCGCGGTAA --p-r-primer GGACTACNVGGGTWTCTAAT --p-min-length 100 --p-max-length 400 --o-reads Zymo_v2.extract.qza
and again only 2-3 features are identified. I have to add that only 2 references only remained in my reference, namely Pseudomonas aeruginosa and Sacharomyces cerevisiae. And now I'm running out of ideas and have the following questions
- Am I missing something in the clustering step?
- Is the import of the reference correct
- Is applying qiime feature-classifier extract-reads correct and if so should I use --p-trunc-len option with 150 as value? Even so the tutorial does not recommend it.
- How would I proceed with creating a taxonomy classifier from this reference?
- How would I create a X% reference-sequence similarity dataset? (as mention in the tutorials)
- What is the general sequencing depth of a 16s library? (as shown above 500,000 fragments seems way to much)
Thanks for you help
mathias