Issues with Classifiers in QIIME 2 - Unusual Assignments Over 99%

JvWyatt · September 18, 2024, 9:08pm

Hello everyone,
My name is JV Wyatt and I am new to using QIIME 2. I am following the forum instructions and here is the relevant information about my issue:

QIIME 2 Version: version 2023.9.1 (Google Colab)
Python Version: 3.8.15
DADA2 Code:
!qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 150
--p-trunc-len-r 150
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza
--verbose

Issue : I am experiencing anomalies in the bar plots when using four different classifiers. In the first three cases, 99% of the assignments correspond to Bacteria and the last one, approximately 99% is Unassigned.
The assignments from the QIIME 2 classifiers only reach level 1. The assignments from the non-saline soil classifier reach up to level 7 (species level) Why?

Two of these classifiers are from QIIME 2 resources:

QIIME 2 Data Resources **
[Data resources — QIIME 2 2024.2.0 documentation].

Classifiers Used:

Silva 138 99% OTUs from the 515F/806R region sequences.
Silva 138 99% OTUs full-length sequences.

The other two are from the following page: [Silva 138.1 taxonomy classifiers for use with QIIME 2 q2-feature-classifier]

Non-saline soil classifier (515f-806r)
Non-saline soil classifier (full-length)

I would appreciate any help or suggestions on how to resolve this issue.

Thank you in advance!

SoilRotifer · September 18, 2024, 9:35pm

Hi @JvWyatt,

If you search the forum for "mixed orientation taxonomy" you'll see that the most common reason for obtaining poor taxonomic assignment is due to the fact that the reads are oriented in the opposite direction compared to the reference databases. That is, when using feature-classifier classify-sklearn the reads must be in the same orientation as the reference database.

As it appears that all of your reads are in the wrong orientation, you can follow the 2nd recommendation here, and use RESCRIPt to re-orient the FASTA file output from DADA2.

Alternatively, you you can re-orient your fastqs by flipping how you import your R1 and R2 reads. That is, import your R2 reads as your forward reads and your R1 reads as your reverse reads. Then re-run DADA2 and then your taxonomy assignment. This will have the effect of reverse complimenting your reads, putting them in the 'correct' orientation for the classifier.

One quick test you can do, prior to running the above two approaches, just to sanity-check your data. That is run feature-classifier classify-consensus-vsearch. This method does not care about sequence orientation, and should provide reasonable taxonomy assignment. But be wary of making phylogenies or comparing this sequence data other projects, as the sequences would appear very different due to differences in orientation.

JvWyatt · September 19, 2024, 12:24am

I followed the forum’s recommendation and performed a quick test using the classify-consensus-vsearch method in QIIME2. Here is the code I used:

!qiime feature-classifier classify-consensus-vsearch
--i-query rep-seqs.qza
--i-reference-reads silva-138-99-seqs-515-806.qza
--i-reference-taxonomy silva-138-99-tax-515-806.qza
--p-perc-identity 0.97
--p-threads 4
--o-classification taxonomyvsearchv4.qza
--o-search-results search-resultsv4.qza

Attached are images of the results obtained.

SoilRotifer · September 19, 2024, 1:20am

Hi @JvWyatt,

Is this 16S rRNA marker gene data? a different marker gene? or shotgun metagenomics data?

If this is shotgun metagenomics data, then you should install the shotgun / metagenome distribution. You can start with this tutorial.

-Mike

JvWyatt · September 19, 2024, 2:26am

I'm really not sure, I'm new to this. The data comes from soil samples that underwent PCR with the 515f - 806r 16S rRNA primers. That's why I considered them as amplicons and used that distribution (amplicons).

SoilRotifer · September 19, 2024, 1:11pm

Hmm... I am not sure what the issue is then. It is concerning that you are not obtaining any taxonomic assignment, even with vsearch. Perhaps try vsearch and sklearn using the full length sequence classifier and not the 515f-806? Just in case this is not a V4 amplicon, and is from another amplicon region (i.e. V3V4, or V4V5), you should be able to still identify the reads.

JvWyatt · September 19, 2024, 6:22pm

In the end, I will show you the results by running the classifier classify-consensus-vsearch with sequence length. But first, I want to show you the workflow to see if I'm missing any steps or if I'm doing something wrong. For example, I've seen that something is done with the primers, and I'm not sure if I need to do something with that.

Workflow:

I imported my data using a manifest.

!qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path data/manifest.tsv
--output-path paired-end-demux.qza
--input-format PairedEndFastqManifestPhred33V2

Check the quality.

!qiime demux summarize
--i-data paired-end-demux.qza
--o-visualization qualities.qzv
qualities.qzv (309.9 KB)

Filtering with DADA2.

!qiime dada2 denoise-paired \

--i-demultiplexed-seqs paired-end-demux.qza \

--p-trim-left-f 0 \

--p-trim-left-r 0 \

--p-trunc-len-f 150 \

--p-trunc-len-r 150 \

--o-table table.qza \

--o-representative-sequences rep-seqs.qza \

--o-denoising-stats denoising-stats.qza \

--verbose

Verbose.

Running external command line application(s). This may print messages to stdout and/or stderr.

The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada.R --input_directory /tmp/tmpnp9nty94/forward --input_directory_reverse /tmp/tmpnp9nty94/reverse --output_path /tmp/tmpnp9nty94/output.tsv.biom --output_track /tmp/tmpnp9nty94/track.tsv --filtered_directory /tmp/tmpnp9nty94/filt_f --filtered_directory_reverse /tmp/tmpnp9nty94/filt_r --truncation_length 150 --truncation_length_reverse 150 --trim_left 0 --trim_left_reverse 0 --max_expected_errors 2.0 --max_expected_errors_reverse 2.0 --truncation_quality_score 2 --min_overlap 12 --pooling_method independent --chimera_method consensus --min_parental_fold 1.0 --allow_one_off False --num_threads 1 --learn_min_reads 1000000

Warning message:

package ‘optparse’ was built under R version 4.2.3

R version 4.2.2 (2022-10-31)

Loading required package: Rcpp

DADA2: 1.26.0 / Rcpp: 1.0.11 / RcppParallel: 5.1.6

Filtering ....................
Learning Error Rates

154827000 total bases in 1032180 reads from 17 samples will be used for learning the error rates.

Denoise samples ....................

....................

Remove chimeras (method = consensus)
Report read numbers through the pipeline
Write output

Saved FeatureTable[Frequency] to: table.qza

Saved FeatureData[Sequence] to: rep-seqs.qza

Saved SampleData[DADA2Stats] to: denoising-stats.qza

View DADA2.qzv
denoising-stats.qzv (1.2 MB)
rep-seqs.qzv (210.9 KB)
table.qzv (402.1 KB)

Feature with more frequency Blasts link.
10803188578946e196b365832c5f8f0

Classifier classify-consensus-vsearch-full-length.

!qiime feature-classifier classify-consensus-vsearch
--i-query rep-seqs.qza
--i-reference-reads silva-138-99-seqs.qza
--i-reference-taxonomy silva-138-99-tax.qza
--p-perc-identity 0.97 \
--p-threads 4
--o-classification taxonomyvsearchfull.qza
--o-search-results search-resultsfull.qza

Views Perc-identity 97-10.

Perc-identity 0.97 - All Feature ID are Unassigned.
search-resultsfull97.qzv (1.2 MB)
taxonomyvsearchfull97.qzv (1.2 MB)

Perc-identity 0.95 - All Feature ID are Unassigned.
search-resultsfull95.qzv (1.2 MB)
taxonomyvsearchfull95.qzv (1.2 MB)

Perc-identity 0.85 - Only 1/160 Feature ID are assigned.
search-resultsfull85.qzv (1.2 MB)
taxonomyvsearchfull85.qzv (1.2 MB)
Perc-identity 0.50 - Only 4/160 Feature ID are assigned.
search-resultsfull50.qzv (1.2 MB)
taxonomyvsearchfull50.qzv (1.2 MB)
Perc-identity 0.10 - All Feature IDs are assigned, but 99% only reach the d_bacteria taxon level.
search-resultsfull10.qzv (1.4 MB)
taxonomyvsearchfull10.qzv (1.2 MB)

SoilRotifer · September 19, 2024, 6:31pm

Hi @JvWyatt,

Thank you for sharing screen shots. Though it'd be best to share the QZVs, rather than screen shots, as we can determine the commands you run via the provenance tab of the QZV.

Anyway, I noticed in your rep-seqs.qzv screen shot that you only have 160 features! Which tells me that your are losing a lot of data through the denoising process. I am guessing the sequences that remain are poor quality sequences that are driving poor taxonomy assignments. Can you share your stats QZV file output from DADA2? This will tell us where you are losing sequences.

-Mike

JvWyatt · September 19, 2024, 6:56pm

Sure. I edited the previous post by changing the captures to the QZV files.
denoising-stats.qzv (1.2 MB)
rep-seqs.qzv (210.9 KB)
table.qzv (402.1 KB)

SoilRotifer · September 19, 2024, 7:43pm

Thanks @JvWyatt,

If you look at the denoising-stats.qzv file you'll notice that your are losing many reads due to failed paired-end merges. With only ~ 10% of your reads merging. this means you need to adjust your --p-trunc-len-* parameters. Based on your quality score plots I'd recommend the followng:

--p-trunc-len-f 149
--p-trunc-len-r 149

The issue is likely that there is not enough overlap between the paired-ends to merge (you need 12 bases of overlap for DADA2), or there are too many mismatches in the region of overlap which results in an unsuccessful merge, or both.

Do you know if these data were generated using the EMP approach? If so, you should be fine with 2x150 sequencing runs, and the quality is an issue. Thus you need to play around with the truncation parameters.

But I am assuming you will likely be unable to merge the reads, as I do see the PCR primers contained within your reads which match the patterns: GTGYCAGCMGCCGCGGTAA near the 5' end and ATTAGAWACCCBNGTAGTCC (reverse compliment ) near the 3' end. These data would require 2x250 sequencing run, see here. I suspect that you may be limited to processing your data only using the forward reads. This also means you'd need to run cutadapt to remove the primers prior to analysis.

-Mike

JvWyatt · September 19, 2024, 9:39pm

I’m not sure if I used EMP, I did PCR with the primers 515F-806R in duplicate, and then the index PCR.

PCR1a

16s515F_IL:

CACTCTTTCCCTACACGACGCTCTTCCGATCT+GTGCCAGCMGCCGCGGTAA

16s806R_IL:

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT+GGACTACHVGGGTWTCTAAT

PCR1b

16s515F_IL-N3:

CACTCTTTCCCTACACGACGCTCTTCCGATCTATA+GTGCCAGCMGCCGCGGTAA

16s806R_IL-N3:

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTACT+GGACTACHVGGGTWTCTAAT

Read primer adapter+specific primer sequence.

PCR2

INDEX:

SA5F X SA7R: 96 Combinations.

SB5F X SB7R: 96 Combinations.

I'm not sure if this answers your question

SoilRotifer · September 19, 2024, 9:42pm

I should have removed that question after I discovered the primer sequences in the read.

I am confident that the EMP approach was not used as we can see the primers in the sequence. Which is something we'd not expect with the EMP approach. Thus, the post I linked to applies. The paired reads are likely not long enough to merge. You may have to proceed with only using the forward sequences, using cutadapt to remove the primer.

However, do not forget to try the suggested truncation settings and perhaps lower the --p-min-overlap from 12 to 6 or something. On the off-chance you can get the reads to merge. Again, a similar approach was discussed in the thread I linked.

JvWyatt · September 19, 2024, 10:23pm

This means that before using the DADA2 plugin, you should use vsearch join-pairs and then process the data with DADA2 as if you were working with single-end samples.

It seems that the join-pairs command is not available in the version of vsearch I have installed. I have the option to use merge-pairs, but it appears that the --p-min-overlap parameter is not available for the merge-pairs command.

SoilRotifer · September 20, 2024, 12:35am

No. Sorry for the confusion. I simply mean to adjust the minimum overlap, i.e. use the --p-min-overlap of DADA2. Joining the reads with vsearch and treating them as single-end reads violates the assumptions of the DADA2 model and is inappropriate. I suggest reading the documentation for DADA2.

The other user switched to using the following approach to join reads and denoise with deblur instead of using DADA2 (which does it all in one go).

JvWyatt · September 25, 2024, 7:55am

Hello, sorry for the disconnect for so many days, I had a loss in the family. Today I was able to try the recommendation of setting --p-min-overlap to 6 in DADA2. The improvement was noticeable, the merge percentage increased from 10 to 65%.
denoising-stats150.qzv (1.2 MB)
rep-seqs150.qzv (1.2 MB)
table150.qzv (674.7 KB)
filtered-table150.qzv (685.2 KB)
Next, I ran a test using feature-classifier classify-consensus-vsearch and these were the results:
search-resultsv150.qzv (2.9 MB)
taxonomyvsearchv150.qzv (1.8 MB)

Much better, so I used the classifier. Silva 138 99% OTUs full-length sequences with sklearn version 1.4.2, the percentage of Unassigned is now 20%.
taxonomy-full-silva.qzv (721.0 KB)

And the classifier Silva 138 99% OTUs from 515F/806R region of sequences with sklearn-versión 0.24.1; Unassigned 20% too.
silva-taxonomy-515f-806r.qzv (864.8 KB)

Should I apply the recommendation to use RESCRIPt to reorient the output of the FASTA file from DADA2, or is it not necessary in this case?

SoilRotifer · September 25, 2024, 1:36pm

Hi @JvWyatt,

First, I'd like to say that I'm so sorry for your loss! I certainly know what that's like.

Secondly, these are very promising results! I would suggest altering the --p-min-fold-parent-over-abundance flag of DADA2, and set to 8 or 16, as outline here. This should help increase the non-chimeric values, as those are still quite low. A typical run often sees 70 - 80% chimeric reads. But they do get to ~ 50-60% on poor runs.

Also, I think you'd be able to increase your merges if you set the following, as I mentioned earlier:

--p-trunc-len-f 149
--p-trunc-len-r 149

That last base for each read is quite high and could affect your merges. I'd also perform a simple sanity-check by running dada2 denoise-single. That is, only use the forward reads. This will give an upper-bound on the amount of data you can keep as you will not be dependent on merging. I've had to concede defeat for a few projects myself, where I only used the forward reads. This is totally fine.

I do not think there is a need to re-orient your sequences, given how many appear to have successfully been classified.

JvWyatt · September 25, 2024, 2:05pm

Thank you, I really appreciate it.

SoilRotifer:

Also, I think you'd be able to increase your merges if you set the following, as I mentioned earlier:
--p-trunc-len-f 149
--p-trunc-len-r 149
That last base for each read is quite high and could affect your merges. I'd also perform a simple sanity-check by running dada2 denoise-single. That is, only use the forward reads. This will give an upper-bound on the amount of data you can keep as you will not be dependent on merging. I've had to concede defeat for a few projects myself, where I only used the forward reads. This is totally fine.

Before continuing, I have DADA2 files truncated at 149 and 150, with merge percentages of 62% and 65% respectively. According to my inexperienced interpretation, I can proceed with 150, but I would like to share it with you to get your opinion. Do you think I should proceed with 149? I would like to know how to reach that conclusion, as I want to learn what I should focus on.
Trunc 149
denoising-stats149.qzv (1.2 MB)
rep-seqs149.qzv (1.2 MB)
table149.qzv (665.5 KB)

Trunc 150
denoising-stats150.qzv (1.2 MB)
rep-seqs150.qzv (1.2 MB)
table150.qzv (674.7 KB)

SoilRotifer · September 25, 2024, 2:12pm

Thank you for providing these files. Yeah, you're right, the 150 is better. I just wanted to sanity-check as I've had a few cases where this resulted in worse merges... but each data set is different.

JvWyatt · September 25, 2024, 5:29pm

The percentage of non-chimeric improved significantly, growing by 30%. I think the --p-min-fold-parent-over-abundance flag of DADA2, set to 16, was better than with 8
denoising-stats8.qzv (1.2 MB)
denoising-stats16.qzv (1.2 MB)

SoilRotifer · September 25, 2024, 5:43pm

Yeah, the 16 is substantially better! For most datasets this has become my default setting. If any bad data get through this... they are often removed when I perform other QA/QC steps, like taxonomy, abundance-based, and other filtering steps.