"classifier does not support confidence values" with non-16S/18S data

Chris_Barrington · January 24, 2019, 1:07pm

Hi,

I am analysing a yeast amplicon-type experiment and was hoping that I would be able to use QIIME2 for a substantial part (if not all) of the process. I have multiple samples taken over a time course and multiple replicates for each (24 FastQ files in total) that have already been demultiplexed by our sequencing service.

However, the data are not 16S/18S derived. The libraries were constructed following PCR amplification of a unique barcode on a vector to identify yeast strains - the presence of which I want to assess in each treatment condition and time point. The FastQ reads have a structure like: .*[PCR PRIMER][BARCODE][VECTOR]. All of the barcodes are 20bp but some of the reads have a few bp before the PCR primer sequence; the barcodes are not in exactly the same position for all reads.

My first step was to remove the PCR primer and vector sequence in the FastQ using FastX toolkit, leaving me with a FastQ file where every sequence is 20bp. I suspect this may not have been the best choice since the error correction in the dada2 step would be able to use that information? I suspect I should remove these parts of the reads in QIIME2 perhaps with the cutadapt plugin?

Using a subset (first 25,000 reads) of the pretrimmed data I have used QIIME2 to get what I think/hope are correct results, using the .qzv files to check for sensible output. The stage I am stuck on it the classification step because I need to train my own classifier and that step seems to be failing. I have a FastA file for each barcode and generated a taxonomy table for each barcode where the species label was the name of the strain identified by the barcode.

30405
ATACTGACAGCACGCATGGC
30402
TATGGCACGGCAGACATTCC
30403
AGGCATACTACACAGATTCC

30405 k__Fungi; p__Ascomycota; c__Saccharomycetes; o__Saccharomycetales; f__Saccharomycetaceae; g__Saccharomyces; s__YAL002W
30402 k__Fungi; p__Ascomycota; c__Saccharomycetes; o__Saccharomycetales; f__Saccharomycetaceae; g__Saccharomyces; s__YAL004W
30403 k__Fungi; p__Ascomycota; c__Saccharomycetes; o__Saccharomycetales; f__Saccharomycetaceae; g__Saccharomyces; s__YAL005C

And I generate the classifier, but the classifier fails when I test it:

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads dada2_denoise-single/representative_sequences.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier classifier.qza

qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads dada2_denoise-single/representative_sequences.qza
--o-classification taxonomy.qza

Plugin error from feature-classifier:

this classifier does not support confidence values

Debug info has been saved to /tmp/qiime2-q2cli-err-dp5zk1s1.log

I am using the conda environment qiime2-2018.11 to run QIIME2 and the commands that I used up until this point are below. I have followed the Moving Pictures tutorial with the example data and am now trying to use that as a framework with my data. The visualisation steps seem to run OK and samples cluster somewhat (though I have not run the analysis on my complete FastQ files) - I have excluded these from below for clarity.

qiime tools import
--type 'SampleData[SequencesWithQuality]'
--input-path sample-manifest.csv
--output-path demux.qza
--input-format SingleEndFastqManifestPhred33

qiime dada2 denoise-single
--i-demultiplexed-seqs demux.qza
--p-trim-left 0
--p-trunc-len 0
--p-n-threads 24
--p-no-hashed-feature-ids
--output-dir dada2_denoise-single
qiime metadata tabulate
--m-input-file dada2_denoise-single/denoising_stats.qza
--o-visualization dada2_denoise-single/denoising_stats.qzv

make visual summaries of the count data

qiime feature-table summarize
--i-table dada2_denoise-single/table.qza
--o-visualization dada2_denoise-single/table.qzv
--m-sample-metadata-file sample-metadata.tsv
qiime feature-table tabulate-seqs
--i-data dada2_denoise-single/representative_sequences.qza
--o-visualization dada2_denoise-single/representative_sequences.qzv

qiime tools import
--type 'FeatureData[Sequence]'
--input-path barcodes.fa
--output-path barcodes.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path taxonomy.txt
--output-path ref-taxonomy.qza

I am obviously new to using QIIME2 and believe that QIIME2 would be suitable for this type of analysis - please correct me if I am wrong. If there is any additional information that would be helpful, please let me know. Assuming that I have used the import/dada2 steps correctly, please can you help me understand how to create a classifier for my analysis?

Thanks,

Chris

Nicholas_Bokulich · January 24, 2019, 2:21pm

Hi @Chris_Barrington,
This is not an issue specific to non-16S/18S data — it is almost always a formatting issue (we only encounter it when someone is compiling their own custom reference database). What marker gene are you looking at, by the way?

Most common issue: the taxonomies do not have equal numbers of ranks. You should make sure that each entry has the same number of ranks; if anything is missing species level, for example, just add an empty rank ('s__'). Punctuation errors (e.g., missing semicolons) will also cause this issue.

Let me know if that helps you spot the problem! If not, we can help dig a little further...

Chris_Barrington · January 24, 2019, 2:33pm

OK thanks, I suspected it was to do with the taxonomy since the rest (seemingly) worked without issues - just parameter changes. I will double check that table.

It isn't a marker gene that we are looking for; the barcodes are artificial 20-mer sequences that are part of a vector that is transformed into yeast. So rather than looking for species using gene sequence I want to track prevalence of the vector in the population by the unique barcode.

I'll be sure to check back if I have further problems.

Nicholas_Bokulich · January 24, 2019, 2:37pm

Oh very cool. You probably want to find exact matches, then (unless if you expect that the barcodes are distant enough from one another that you can trace back any barcodes with errors). In that case, you may want to just do one of two things:

Use the vsearch-LCA classifier (classify-consensus-vsearch). Set maxaccepts to 1, and set perc-identity to 99 or 100.
Use classify-sklearn but disable the confidence scores (set --p-confidence -1). I would not normally recommend this — it turns this into a top-match-finding algorithm — but it makes sense in your case and would actually bypass the need to fix the taxonomy.

Chris_Barrington · January 25, 2019, 2:00pm

The problem was my extracted reads FeatureData[Sequence]. I ran extract-reads on the artefact of my imported barcodes FastA file. When I ran extract-reads I had included primer sequences, but these were not in the barcodes FastA which (I guess) caused the FeatureData artefact to be empty. So I was trying to build a database against an empty feature set. When I realised that the rest of the analyses worked. I am now rerunning the analysis on a larger subset of my data to get a better idea of the output before analysing the full dataset.

In my taxonomy table, I had every entry classified to the same (k/p/c/o/f/g) without missing values. In the view.qiime2 server I can group by the 7 taxa levels. Can I include an 8th level so that I can describe strain in my taxonomy classification? At the moment, I am including the strain information rather than species. (this only just occurred to me as I was writing).

The classifier seems to do a good job of matching the barcodes to strains - they are not always exact matches but my first test the barcodes with 90% likelihoods look to be the correct matches based on manual inspection. At the moment I am leaning towards using the classifier rather than exact matches, the barcodes seem different enough and the sequenced tags do sometimes have miscalled bases or indels which the classifier seems to handle nicely (on my initial test subset).

I do still have a few more questions, not related to this thread though, so may be posting more questions soon but this issue seems to be resolved. Thanks for your help and quick responses, it is very much appreciated.

Chris

Nicholas_Bokulich · January 25, 2019, 2:07pm

Thanks for confirming! Definitely makes sense.

Yes! You can include as many levels as you like, just so long as they have an equal number of levels.

Definitely a good argument for using the naive Bayes classifier rather than exact matching with vsearch-LCA. You could also do an inexact match + consensus assignment with vsearch-LCA, but it the naive Bayes is working well you might as well stick with it.

Sounds like a very interesting analysis. If your methodology is something that you think others in your field would find useful, please feel free to post a "community tutorial" on the forum at some point. We are always keen to get experts in different areas to write tutorials describing diverse uses of QIIME 2.

Chris_Barrington · January 29, 2019, 11:44am

Hi @Nicholas_Bokulich

My initial test subset of data was identifying ~10 or so features following dada2 denoise-single. I did not think this was too much of a problem since I was only looking at a few thousand reads. However, when I ran the dada2 step on the complete dataset, the representative-sequences artefact contains only 33 features - we expect to see thousands of strains/barcodes/features identified in some, if not all, samples.

My first suspicion was that the reads were being grouped together despite being distinct. In the paper, there is an OMEGA_A parameter that splits partitions - perhaps this threshold is set too low for this type of data and my barcodes are being kept in partitions when they should be split. I could not find a parameter in the dada2 denoise-single --help documentation that could tune this parameter, however.

When I saw this problem I was reminded of your previous recommendation to find exact matches only. I tried this using qiime feature-classifier classify-consensus-vsearch --p-maxaccepts 1 --p-perc-identity 0.99 --i-reference-reads barcodes.qza --i-reference-taxonomy barcodes_taxonomy.qza --i-query dada2_denoise-single/representative_sequences.qza but every sequence/feature is unassigned with 1.0 confidence Maybe this also points to the representative-sequences being representative of too-diverse a partition and no longer being a match to a reference barcode.

I am currently running the deblur denoise-other method to see if that yields a more suitable feature table.

So I think my issue is that the denoise step (using dada2) is grouping my data into too few partitions and not representing the expected diversity in the data. If there are any other suggestions as to the cause or further diagnostics that may help or general advice, I would be very appreciative.

Thanks,

Chris

Nicholas_Bokulich · January 29, 2019, 1:44pm

Hi @Chris_Barrington,

I agree. If you are expecting thousands of unique barcodes to be present but dada2 yields only 33 features, then you've got a problem with your denoiser.

That parameter must not be exposed in the q2-dada2 pipeline. You could run dada2 directly in R, where you will have access to the full parameter suite, but will need to rebuild the pipeline yourself (not a big deal, the R vignette/tutorial documentation is very good).

The "confidence" scores reported by that classifier are a misnomer... they are actually the fraction of hits that match the reported classification. So 1.0 (==100%) is what you would expect when maxaccepts==1. So nothing wrong with that classifier, it is working as expected.

What do you mean ~10 features? You say you expected thousands of strains/barcodes/features, so I am not sure what you mean when you say 10 features here.

Chris_Barrington · January 29, 2019, 4:24pm

Sorry, I meant that I was not concerned that I had very few features after denoising when I was using my test subset reads. When I ran the analysis on the full dataset and still only saw 10s of features I realised there was a problem.

OK. So the classifier is saying that there are no exact matches in my reference barcodes qza to the representative sequences from dada2. But if I grep the reference barcodes fasta that is imported into the qza I can see some representative sequences are exact matches.

I wonder whether the denoising algorithm is not performing as expected because the data are not 16S-type data. Maybe the short sequences do not lend themselves to this approach. Perhaps using the basic quality-score-based filtering steps hinted in Moving Pictures. This might be worth trying anyway, just to get a better idea of the diversity I can expect in my data.

Thanks for the pointer to the R package. I'll take a look at that and see if I can get more partitions that way. I notice that the table.qzv from dada2 shows that the top features look to be disproportionately represented. If I have interpreted the table correctly, this shows that the partition represented by ATACTGCTGTCGATTCGATA (which is not an exact barcode match) contains 17m reads. Naively, this suggests to me that the partitions are grouping too-dissimilar reads together. The next most frequent feature is represented by a recognised barcode - but I think this only means that that sequence is most frequent in the partition.

Thank you for you help,

Chris

table.qzv (312.1 KB)

Nicholas_Bokulich · January 29, 2019, 4:47pm

Thanks for clarifying.

Oh I missed the part above about these being unclassified... that is really surprising unless if your query sequences contain additional sequence beyond the actual barcode sequence you are aligning. If so, you can use q2-cutadapt to trim to the exact barcode region.

The denoisers are not 16S-specific, but since you are looking at an artificial sequence I agree the marker type and sequence length could be issues. @benjjneb do you have any thoughts about this? Chris is looking at artificial 20-mer sequences.

Yes, basic quality-score filtering followed by OTU clustering may be the way to go. since you are looking at artificial barcodes and know which barcodes you used, you may even be able to devise a custom approach... e.g., if all barcodes have 2+ dissimilar bases from one another, you may be able to correct erroneous barcodes.

Another approach that is wholly within QIIME 2:

use quality score filtering
use qiime vsearch dereplicate to dereplicate your sequences and make a feature table
use qiime quality-control exclude-seqs to eliminate any sequences that do not match your reference sequences. Since errors would be expected to be random and sporadic, it may be possible to just toss erroneous sequences instead of attempting to denoise or cluster.

dada2 attempts to perform error correction so I agree — either this is correcting to an erroneous sequence or else sequencing/PCR error is for some reason causing that sequence to be overrepresented. It is worth grepping the raw fastq just to confirm.

benjjneb · January 29, 2019, 5:37pm

You can use DADA2 on this type of data (see the TubSeq paper for an example that also used 20-mer barcodes), but a bit of caution is in order and sometimes non-default parameters might work better than the defaults. I believe that one of the important things that worked better for those folks was to use a control sample with limited diversity to learn the error model from, as hyper-diverse datasets of random barcodes can lead to over-estimation of the error rates.

Nicholas_Bokulich · January 29, 2019, 5:48pm

Thanks @benjjneb! Indeed, the online methods of the tubaseq paper detail how they tweaked dada2 parameters to denoise their 20mer barcodes. @Chris_Barrington these include parameters that are not exposed in q2-dada2 so you will need to use dada2 directly in R if you want to follow that approach.

Chris_Barrington · January 31, 2019, 10:03am

@benjjneb @Nicholas_Bokulich

Thank you both for your advice; I am very appreciative for experts sparing their time to help people like me on this type of active forum. I will look into all of these suggestions and hopefully get some good results!

Chris

system · March 3, 2019, 4:03pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.