Almost all Reads are unclassified - custom 18S V4 classifier

paperwolf · April 16, 2025, 7:37pm

Hello!

I am using qiime 2 (2024.5) on the university HPC and having some issues with my 18S classifier. It seems to be running as normal, but a majority of my reads are unclassified, which was odd cause if I blast them I am able to get taxonomic classifications. I used the same method for my CO1 classifer and didn't have any issues.

My code:

qiime rescript get-ncbi-data
--p-query '(18S[title] AND rRNA[title]) OR 18s ribosomal rna[title] OR 18S[title] NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title] NOT txid2[ORGN] NOT txid2157[ORGN] NOT txid10239[ORGN])'
--verbose --p-logging-level INFO
--p-n-jobs 5
--o-sequences 18sV4_sequences.qza
--o-taxonomy 18sV4_taxonomy.qza

qiime rescript cull-seqs
--i-sequences 18sV4_sequences.qza
--p-num-degenerates 5
--p-homopolymer-length 12
--o-clean-sequences 18SV4_filtd_seqs.qza

qiime rescript filter-seqs-length
--i-sequences 18SV4_filtd_seqs.qza
--p-global-min 250
--p-global-max 1600
--o-filtered-seqs 18SV4_leng_filtd_seqs.qza
--o-discarded-seqs 18SV4_discarded_seqs.qza

qiime rescript dereplicate --verbose
--i-sequences 18SV4_leng_filtd_seqs.qza
--i-taxa 18sV4_taxonomy.qza
--p-mode 'super'
--p-derep-prefix
--o-dereplicated-sequences 18SV4_derep_seqs.qza
--o-dereplicated-taxa 18SV4_derep_taxa.qza

qiime feature-classifier extract-reads
--i-sequences 18SV4_derep_seqs.qza
--p-f-primer GCAGTTAAAAAGCTCGTAG
--p-r-primer TCCAAGAATTRCACCTCT
--o-reads 18SV4_derep_seqs_extracted.qza

qiime rescript evaluate-fit-classifier
--i-sequences 18SV4_derep_seqs_extracted.qza
--i-taxonomy 18SV4_derep_taxa.qza
--o-classifier 18SV4-classifier.qza
--o-evaluation 18SV4-classifier-evaluation.qzv
--o-observed-taxonomy 18SV4-refseqs-predicted-taxonomy.qza

qiime feature-classifier classify-sklearn
--i-classifier ~/Diet_Classifier/Rescript/18SV4-classifier.qza
--i-reads 18sV4_STG_rep-seqs_Truncated4.qza
--verbose --p-n-jobs 24
--o-classification 18SV4_STG_taxonomy.qza

18SV4_STG_taxonomy.qzv (1.4 MB)
18SV4-refseqs-predicted-taxonomy.qzv (3.8 MB) - predicted taxonomy file from making the classifier

I saw previously on the forum that mixed orientation of the reads can cause unclassifications,so I tested it with vsearch but still half the ASV's are unclassified.
qiime feature-classifier classify-consensus-vsearch
--i-reference-reads ~/Diet_Classifier/Rescript/18SV4_derep_seqs_extracted.qza
--i-reference-taxonomy ~/Diet_Classifier/Rescript/18SV4_derep_taxa.qza
--i-query 18sV4_STG_rep-seqs_Truncated4.qza
--verbose
--o-search-results 18SV4_vsearch_results.txt
--o-classification 18SV4_STG_vsearch_taxonomy.qza
18SV4_STG_vsearch_taxonomy.qzv (1.3 MB)

Any help with this is greatly appreciated!! Thank you!!

SoilRotifer · April 16, 2025, 9:23pm

Hi @paperwolf,

Is there a reason why you simply did not use the SILVA SSU classifier? Remember 16S and 18S are homologues and can even be aligned together in the same alignment. Which is what SILVA does. I'd recommend following parts of the SILVA tutorial, and simply use the SILVA SSU reference database to classify your reads.

To extend what I mentioned above, I am betting good money that all those unclassified reads are actually bacterial 16S rRNA gene sequences. The V4 primer will amplify both 16S and 18S rRNA gene sequences. Because you have no bacterial sequences sequences in your reference database, they will be returned as unclassified.

Even if you only care about 18S rRNA gene sequences, you should always make sure you have outgroup taxa, i.e. bacteria in this case, so you can identify and remove these unwanted taxa.

You can also combine the outputs of this GenBank tutorial with your 18S rRNA gene sequences, and then train that as your classifier. But again, I'd sanity-check with a SILVA classifier first.

-Cheers!

paperwolf · April 17, 2025, 6:37pm

Hi! Thanks for your response!

I actually didn't know there was a Silva classifier I could use. My target is a lot of algae, plants and other larger eukaryotes, so I didn't even think about how 16S/18S are homologues.

When I tried the Silva SSU classifier, I still got over 50% of my ASVs as unclassified with most of the classified ASVs only classified to domain. Any suggestions are greatly appreciated!

Thank you for the help!

colinbrislawn · April 17, 2025, 8:40pm

Good afternoon, paperwolf

'Why are my reads not getting classified' is a classic question in amplicon analysis. It could be lots of things, so there are many options to try.

First, 18SV4_vsearch_results.txt is looks okay to me, so this makes me think the fungal part is working. I would like to test and provide evidance for or against Mike's suggestion that lots of bacteria are sneaking in. vs

To do this, why not run your reads against a 16S only database! It stuff gets hits, that's the issue!

You can do this with vsearch and a premade 16S database (like silva).
Or by opening up your 18sV4_STG_rep-seqs_Truncated4.qza file and copying some ASVs directly into NCBI blast:

BLAST: Basic Local Alignment Search Tool
select Database > rRNA/ITS

image1610×232 29 KB

paperwolf · April 18, 2025, 2:59pm

Hi colinbrislawn!

Thanks for the suggestion!

I tried this and didn't get any hits but when I just blast it against the nt database I am getting hits. In addition, a couple months ago we blasted all the results while I was still trying to figure out the classifier and we were able to get classifications with no issue.

This makes me think I messed something up when I built the classifier but I am not sure what?

Appreciate the help!!

colinbrislawn · April 21, 2025, 2:07pm

Good morning!

So orientation should not be the issue, vsearch should be searching both directions by default.

Do you need some more help from us?

If so, would you be willing to post some of the .qza files with the sequences in them so we can take a look?

There is also the option to share the data with us privately with a private message.
(Click on a person's name, then click Message)

paperwolf · April 21, 2025, 2:48pm

Additional help is greatly appreciated! Please let me know what additional .qza files you might need. Thank you so much!

18sV4_STG_Dada2_Truncated4.qza (90.5 KB)
18sV4_STG_rep-seqs_Truncated4.qza (116.2 KB)

Nicholas_Bokulich · April 22, 2025, 7:17pm

Hi @paperwolf could you please tell us more about how you generated your data? E.g., which primers, sequencing platform, sample type, etc? I have been blasting a random subset of the sequences that you shared and I am not getting any hits for most of these; for very few I do get hits, but only against bacterial genomes when searching the entire nt database! And those hits are just fragments of the full query. In short: it looks like there is something really wrong with the inputs sequences; if these were truly 18S sequences I should be getting plenty of high-quality, high-coverage hits against the 18S reference set. Maybe I am just taking a bad grab, but what I am seeing suggests that there is something wrong with the input sequences.

paperwolf · April 22, 2025, 7:56pm

Hi @Nicholas_Bokulich!

That's so odd! These are 18s V4 primers sequenced on a 2x300 Illumina MiSeq. They are colon swabs from dead sea turtles. I have used the same classifier on a different dataset with cloacal swabs/esophageal lavage samples and got decent results from it, which is partially why I was so surprised that this one hasn't been getting any hits. (after looking into it, that dataset gets some better classification but still has 1000 ASVs unclassified)

Do you think I messed something up in processing the dataset?

Thank you for your help!!

Nicholas_Bokulich · April 22, 2025, 8:55pm

Sounds so interesting!

Yes this aligns with what I am seeing: it is not an issue with the classifier itself, but it looks like the input data might have a problem.

It might not be something you did (though you should of course check your workflow). It could just be that the sequencing run quality was poor, or even something could have happened during library preparation.

paperwolf · April 23, 2025, 3:52pm

Hi @Nicholas_Bokulich!

Do you think there is any way to salvage any data from this? Or is it all just trash? This is all data that was sequenced like two years ago that I need to finish processing and so I just need to make the best out of it as much as possible.

I looked again at my cutadapt and dada2 stats and I realized that the amount of reads merging isn't the highest, so I might try to mess with that and see if it will improve anything?
18sV4_STG_stats-dada2_Truncated4.qzv (1.2 MB)
18sV4_STG_TrimmedReads_demux.qzv (326.4 KB)

Thank you for your help!

Nicholas_Bokulich · April 24, 2025, 6:16pm

I only looked at a snippet of the data and it did not look good. But it might have just been a bad subsample. You would need to check a larger batch to decide, but based on your description above it sounds like the data as they are might not be usable.

Indeed, the % merged looks pretty bad. Still, it's a mystery why those that merge have no hits in the database. But you might try using single-end reads only to see what happens, you will not have merge issues and maybe you could recover some useful data from this.

Good luck!

paperwolf · April 25, 2025, 5:42pm

Thank you so much for your help @Nicholas_Bokulich ! And thank you to @SoilRotifer and @colinbrislawn for also assisting earlier! Much appreciated!

system · May 26, 2025, 11:43pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.