Training feature classifiers with Silva 138 SSURef NR99 full-length sequences

Dear All,
I have six samples with bacteria, archaea, and fungi sequence data combined together in paired-end reads 12 reads. All three were sequenced separately and then merged together by core (I don't know if it is right). The following commands include the primers, which were used for their sequencing.

BACTERIA illumina 16S rRNA primers

qiime feature-classifier extract-reads
--i-sequences silva-138-99-seqs.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--p-trunc-len 270
--p-min-length 200
--p-max-length 400
--o-reads bacteria_ref-seqs.qza

ATCHEAE primers 515F (Caporaso)–806R (Caporaso)

qiime feature-classifier extract-reads
--i-sequences silva-138-99-seqs.qza
--p-f-primer GTGYCAGCMGCCGCGGTAA
--p-r-primer GGACTACHVGGGTWTCTAAT
--p-trunc-len 270
--p-min-length 200
--p-max-length 400
--o-reads Archea_ref-seqs.qza

FUNGI primers EMP: 563F/1132R

qiime feature-classifier extract-reads
--i-sequences silva-138-99-seqs.qza
--p-f-primer GCCAGCAVCYGCGGTAAY
--p-r-primer CCGTCAATTHCTTYAART
--p-trunc-len 270
--p-min-length 200
--p-max-length 400
--o-reads Fungi_ref-seqs.qza

I have 2 question

I could train classifiers for bacteria and fungi. For archaea, I could extract ref-seq using the above command (the output Archea_ref-seqs.qza file size was 14.2mb) but I couldn't get classifier.qza. Below command didn't give output
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads Archea_ref-seqs.qza
--i-reference-taxonomy silva-138-99-tax.qza
--o-classifier Archea_classifier.qza

It took forever without any output.
Am I using the right database for Archea (Silva 138 SSURef NR99)?

For Fungi, I could complete everything well. But the output taxonomy for my six samples was still 99% bacteria. I'm attaching screenshot below. Again, am I using the right database for fungi. In my understanding Silva database contains everything.

NOTE: These are air quality samples collected using air filters.

Hi @drmusk,

It'd be a good idea to get as much information you can about how the samples were sequenced. Ideally, you should be processing the data yourself starting with the raw sequencing data as it comes off the sequencer.

I assume these are from separate sequencing runs? Or have these three different amplicons (bacterial 16S, archaeal 16S, & fungal 18S) sequenced simultaneously on one run? Either approach is fine, I just wanted more details on what is contained within your sequencing files. I assume your data have been separated by "bacterial", "archaeal", and "fungal" targeted amplicons? or are they all in one large file? It'd be best to import the sequencing data for each sequenced amplicon set separately.

The feature-classifier extract-reads ... commands for creating amplicon-specific classifiers for your bacterial and archaeal sequences look fine. Often there is no need to specify the -p-trunc-len ... --p-min-length ... --p-max-length ... commands. I'd simply remove them as you run the risk of discarding valid reference data that might be shorter / longer than the average expected sequence length. This ties into your next question:

It looks like these are 18S rRNA micro-eukaryote primers from Hugerth et al. 2014? There are many representative fungal 18S rRNA gene sequences within the SILVA database, you can check here. But, there is a chance that the -p-trunc-len ... --p-min-length ... --p-max-length ... options are removing many valid fungal references, leaving predominately bacterial sequence data in your resulting classifier. Again, see what you can retain before applying these truncation and length options.

One thing to consider, primers can be "leaky", that is they can amplify off-targets. I'd not be surprised if you occasionally amplify bacterial sequences from time to time, as 16S and 18S are homologues. Even if these primers are meant to preferentially amplify microbial eukaryotes.

1 Like

Dear @SoilRotifer thanks for your reply.
To answer more on sequencing protocol.....

The sequencing core performed 3 failed runs for bacteria, Fungi, and Archaea (separate runs) with a 25% Phi X spike in. The reason for the failed run was the very low diversity in the samples. The DNA concentration was very low in the samples. To obtain a decent concentration of amplicons they performed PCR two times. Later they pooled bacteria, fungi, and archaea libraries equally along with a 25% Phi X spike in for improving the diversity. This was a successful run, and each run have bacteria, Fungi, and Archaea libraries pooled equally.

Yes, 18S rRNA micro-eukaryote primers were taken from [Hugerth et al. 2014].

Now, for fungi feature-classifier extract-reads ... commands. I didn't specify the -p-trunc-len ... --p-min-length ... --p-max-length ....

Still, the output is the same. 99% bacterial sequences.

Thank you for the information @drmusk.

Upon further reading, I confirmed my earlier suspicion about primer set being "leaky". That is, the 563F/1132R primer set is known to amplify bacteria quite well, especially if the eukaryotes are less abundant than the bacteria in the sample. See Figure 1 of Kounosu et al. 2019.

I'm not sure what to recommend at this stage other than trying other DNA extraction and amplification methods that are optimized for low microbial biomass (e.g. KatharoSeq, or try searching for other 18S rRNA gene primer sets that are less likely to amplify bacterial and archaeal sequences, as referenced in the Kounosu et al paper I linked above. There may be other primer sets out there too.

-Good Luck!
-Mike

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.