Hi all,
Sorry in advance for the long post.
I wanted to show you my workflow with a mock community for:
- Testing a classifier
- Community assessment and quality control
And ask a few questions along the way.
First question before even starting:
Should mock community quality control be done separately from the rest of the dataset? As in subset all the controls and work only on those.
For this workflow I worked only on the controls. My control dataset consists of:
2 PCR ready community standards from Zymo
2 extraction positives (Zymo community standards used to test extraction methods)
4 environmental controls
4 extraction blanks
4 PCR negatives
I started by testing my classifier (previously built with RESCRIPt). Zymo provides the sequences of the microbial standards, one can just download the fasta files.
# Merge mock sequences into a single fasta file
cat *.fasta > mock-seqs.fasta
# Import mock fasta sequences
qiime tools import --input-path mock-seqs.fasta --output-path mock-seqs.qza --type 'FeatureData[Sequence]'
# extract primer region reads
qiime feature-classifier extract-reads --i-sequences mock-seqs.qza --p-f-primer GTGYCAGCMGCCGCGGTAA --p-r-primer GGACTACNVGGGTWTCTAAT --p-n-jobs 2 --p-read-orientation 'forward' --o-reads mock-seqs-515F-806R.q
# Taxonomy assignment
qiime feature-classifier classify-sklearn --i-classifier 515F-806R-classifier.qza --i-reads mock-seqs-515F-806R.qza --o-classification mock-taxonomy.qza
# Taxonomy visualisation
qiime metadata tabulate --m-input-file mock-taxonomy.qza --o-visualization mock-taxonomy.qzv
mock-taxonomy.qzv (1.2 MB)
So I am able to identify all taxa present in the mock community but only 3 of them at species level.
I have tried all the taxonomy classification plugins available in qiime and all have returned similar results. I even tried to disable --p-confidence this way I am able to recover species level classification but all are wrongly classified.
I guess is normal not to have confidence at species level?
Then I proceeded to analyse my control samples.
I usually run cutadapt on R (personal preference) so you will see the script "jumping" in and out of qiime.
Same happens for decontam.
I can share the R scripts if anyone wants it.
# import sequences
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path seqs --input-format CasavaOneEightSingleLanePerSampleDirFmt --output-path demux-paired-end.qza
# visualyse quality plots
qiime demux summarize --i-data demux-paired-end.qza --o-visualization demux-quality-plots.qzv
# exit qiime
conda deactivate
# call R execute cutadapt
Rscript cutadapt.R
# activate qiime again
conda activate qiime2-2022.2
# Import primer trimmed reads
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path seqs/cutadapt --input-format CasavaOneEightSingleLanePerSampleDirFmt --output-path primer-trimmed-demux.qza
# Generate quality plots (2nd time)
qiime demux summarize --i-data primer-trimmed-demux.qza --o-visualization primer-trimed-demux-quality-plots.qzv
# denoise
qiime dada2 denoise-paired --i-demultiplexed-seqs primer-trimmed-demux.qza --p-trunc-len-f 253 --p-trunc-len-r 185 --p-trunc-q 2 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza
# Taxonomy assignment
qiime feature-classifier classify-sklearn --i-classifier 515F-806R-classifier.qza --i-reads rep-seqs.qza --o-classification taxonomy.qza
# exit qiime
conda deactivate
# call R execute decontam
Rscript decontam.R
# activate qiime
conda activate qiime2-2022.2
#Import biom table from R to qiime
qiime tools import --input-path table-nocontam.biom --type 'FeatureTable[Frequency]' --input-format BIOMV100Format --output-path table-nocontam.qza
# Taxonomy based filtering
qiime taxa filter-table --i-table table-nocontam.qza --i-taxonomy taxonomy.qza --p-exclude mitochondria,chloroplast,Unassigned,Vertebrata,Eukaryota --p-include p_ --o-filtered-table table-taxa-filter.qza
# Filter unique features
qiime feature-table filter-features --i-table table-taxa-filter.qza --p-min-samples 2 --o-filtered-table table-taxa-filter-no_singles.qza
# Abundance and prevalence filtering
qiime feature-table filter-features-conditionally --i-table table-taxa-filter-no_singles.qza --p-abundance 0.0001 --p-prevalence 0.05 --o-filtered-table filtered-table.qza
I have also tested this without running cutadapt and using qiime dada2 denoise-paired --p-trim-left-r 20 --p-trim left-f 20.
Next I did a quality control of the resulting community standard controls using:
qiime quality-control evaluate-composition
#Import expected taxonomic composition of mock samples
biom convert -i mock-expected.tsv -o mock-expected.biom --table-type="OTU table" --to-hdf5
qiime tools import --input-path mock-expected.biom --type 'FeatureTable[RelativeFrequency]' --input-format BIOMV210Format --output-path mock-expected.qza
#Get the observed taxonomic composition of mock samples (subset community standard samples)
qiime feature-table filter-samples --i-table filtered-table.qza --m-metadata-file meta.tsv --p-where "type='positive'" --p-no-exclude-ids --o-filtered-table mock-observed.qza
# Inspect the taxonomic composition of mock samples at ASV level (extract bar plots only for positive controls)
qiime taxa barplot --i-table mock-observed.qza --i-taxonomy taxonomy.qza --m-metadata-file buzzard_meta.tsv --o-visualization mock-observed-bar-plot.qzv
# Agglomerate taxa at species level
qiime taxa collapse --i-table mock-observed.qza --i-taxonomy taxonomy.qza --p-level 7 --o-collapsed-table mock-observed-l7.qza
# Convert sequence counts into relative abundances
qiime feature-table relative-frequency --i-table mock-observed-l7.qza --o-relative-frequency-table mock-observed-l7-rel.qza
# Compare observed and expected taxonomic composition of mock samples
qiime quality-control evaluate-composition --i-expected-features mock-expected.qza --i-observed-features mock-observed-l7-rel.qza --o-visualization mock-comparison.qzv
mock-expected.tsv I think should be formated as follows (composition is provided by Zymo):
mock_expected.tsv (1.2 KB)
Result as follows:
mock-comparison-cutadapt.qzv (468.9 KB)
mock-comparison-no-cutadapt.qzv (468.9 KB)
A couple of questions now:
1. Is this a sound workflow? Should I do anything more to improve it?
2. Looking at the results am I right to assume that I am able to capture most of the taxa (not at species level) but the composition seems a bit lower than expected?
3. 3-4 false positives is this ok, is it a lot? Not sure how to interpret these results. Are the numbers in front of the false positives abundances?
4. Not sure what to think of this "false positives underclassifications" its almost exactly the same as the false negatives
5. I am guessing that the number of false negatives is because I could not classify the samples at species level and the algorithm takes in consideration the full classification (I read this somewhere in the forum, cannot find the post
6. In this case using --p-trim-left
instead of using cutadapt seems to yield better results. Not only on the mock-comparison but also in terms of final amount of reads left. Based on these results can I say that I should use one instead of the other?
Sorry again for the long post.
Cheers,
Hugo