Hi,
I am analysing a yeast amplicon-type experiment and was hoping that I would be able to use QIIME2 for a substantial part (if not all) of the process. I have multiple samples taken over a time course and multiple replicates for each (24 FastQ files in total) that have already been demultiplexed by our sequencing service.
However, the data are not 16S/18S derived. The libraries were constructed following PCR amplification of a unique barcode on a vector to identify yeast strains - the presence of which I want to assess in each treatment condition and time point. The FastQ reads have a structure like: .*[PCR PRIMER][BARCODE][VECTOR]
. All of the barcodes are 20bp but some of the reads have a few bp before the PCR primer sequence; the barcodes are not in exactly the same position for all reads.
My first step was to remove the PCR primer and vector sequence in the FastQ using FastX toolkit, leaving me with a FastQ file where every sequence is 20bp. I suspect this may not have been the best choice since the error correction in the dada2
step would be able to use that information? I suspect I should remove these parts of the reads in QIIME2 perhaps with the cutadapt plugin?
Using a subset (first 25,000 reads) of the pretrimmed data I have used QIIME2 to get what I think/hope are correct results, using the .qzv
files to check for sensible output. The stage I am stuck on it the classification step because I need to train my own classifier and that step seems to be failing. I have a FastA
file for each barcode and generated a taxonomy table for each barcode where the species label was the name of the strain identified by the barcode.
30405
ATACTGACAGCACGCATGGC
30402
TATGGCACGGCAGACATTCC
30403
AGGCATACTACACAGATTCC
30405 k__Fungi; p__Ascomycota; c__Saccharomycetes; o__Saccharomycetales; f__Saccharomycetaceae; g__Saccharomyces; s__YAL002W
30402 k__Fungi; p__Ascomycota; c__Saccharomycetes; o__Saccharomycetales; f__Saccharomycetaceae; g__Saccharomyces; s__YAL004W
30403 k__Fungi; p__Ascomycota; c__Saccharomycetes; o__Saccharomycetales; f__Saccharomycetaceae; g__Saccharomyces; s__YAL005C
And I generate the classifier, but the classifier fails when I test it:
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads dada2_denoise-single/representative_sequences.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier classifier.qzaqiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads dada2_denoise-single/representative_sequences.qza
--o-classification taxonomy.qzaPlugin error from feature-classifier:
this classifier does not support confidence values
Debug info has been saved to /tmp/qiime2-q2cli-err-dp5zk1s1.log
I am using the conda environment qiime2-2018.11 to run QIIME2 and the commands that I used up until this point are below. I have followed the Moving Pictures tutorial with the example data and am now trying to use that as a framework with my data. The visualisation steps seem to run OK and samples cluster somewhat (though I have not run the analysis on my complete FastQ files) - I have excluded these from below for clarity.
qiime tools import
--type 'SampleData[SequencesWithQuality]'
--input-path sample-manifest.csv
--output-path demux.qza
--input-format SingleEndFastqManifestPhred33qiime dada2 denoise-single
--i-demultiplexed-seqs demux.qza
--p-trim-left 0
--p-trunc-len 0
--p-n-threads 24
--p-no-hashed-feature-ids
--output-dir dada2_denoise-single
qiime metadata tabulate
--m-input-file dada2_denoise-single/denoising_stats.qza
--o-visualization dada2_denoise-single/denoising_stats.qzvmake visual summaries of the count data
qiime feature-table summarize
--i-table dada2_denoise-single/table.qza
--o-visualization dada2_denoise-single/table.qzv
--m-sample-metadata-file sample-metadata.tsv
qiime feature-table tabulate-seqs
--i-data dada2_denoise-single/representative_sequences.qza
--o-visualization dada2_denoise-single/representative_sequences.qzvqiime tools import
--type 'FeatureData[Sequence]'
--input-path barcodes.fa
--output-path barcodes.qzaqiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path taxonomy.txt
--output-path ref-taxonomy.qza
I am obviously new to using QIIME2 and believe that QIIME2 would be suitable for this type of analysis - please correct me if I am wrong. If there is any additional information that would be helpful, please let me know. Assuming that I have used the import/dada2 steps correctly, please can you help me understand how to create a classifier for my analysis?
Thanks,
Chris