Classify the phylum of d__Eukaryota;__

Hi

I already conducted my 18s amplicon sequencing

Here is the primer I use for PCR
FR-1 = ANCCATTCAATCGGTANT
Fun18s = CCATGCATGTCTAAGTWTAA

I use MinION MK1B for the sequencing, the library was prepared using the Oxford Nanopore Rapid Barcoding Kit (SQKRBK114.24).
MinKNOW were used for basecalling, adapter trimming and barcode trimming.
The sequencing runs were stopped after all the samples reached at least 1.5k reads.

I use this command to combine my fastq file in linux
cat *.fastq.gz > merged.fastq.gz

Then the fastq file is feed into fastQC and MultiQC in Galaxy (https://usegalaxy.org/)

The insight from the fastQC and MultiQC is use for later quality control steps in fastp and porechop also in Galaxy (https://usegalaxy.org/)

Porechop

  • Output format for the reads: fastq
  • Barcode binning setting
  • Percent identity for binning = 75
  • Minimum difference in identity = 5.0
  • Require two matches for binning = No
  • Discard unassigned reads = No

Adapter search setting

  • Minimum identity = 90
  • Number of alligned reads = 10000
  • Scoring scheme = 3,-6,-5,-2

End adapter setting

  • Number of bases =150
  • Minimum trim length = 4
  • Extra bases trimmed = 2
  • Minimum percent identity = 75

Middle adapter setting

  • Skip splitting = No
  • Discard reads with middle adapter = No
  • Minimum percent identity = 85
  • Extra trimming good side = 10
  • Extra trimming bad side = 100
  • Minimum length reads post-split = 1000

fastp
Single-end

Adapter trimming options
Filter options

  • Qualified quality phred = 9
    Length filtering option
  • Length required = 200
  • Maximum length = 1000
    Read Modification Options
  • PolyG tail trimming = Disable polyG tail trimming

Here is the result from the fastQC and MultiQC after the quality control

Next the fastq file in feed into Qiime2 Amplicon Distribution Environment

qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path manifest_18s_dr.tsv \
  --output-path single-end-demux.qza \
  --input-format SingleEndFastqManifestPhred33V2

qiime demux summarize \
  --i-data single-end-demux.qza \
  --o-visualization demux.qzv

qiime vsearch dereplicate-sequences \
  --i-sequences single-end-demux.qza \
  --o-dereplicated-table table.qza \
  --o-dereplicated-sequences rep-seqs.qza

qiime feature-table summarize \
  --i-table table.qza \
  --o-visualization table.qzv \
  --m-sample-metadata-file metadata_18s.tsv
qiime feature-table tabulate-seqs \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

Then i use the q2-feature-classifier

I downloaded and import the pre-formatted SILVA reference sequence and taxonomy files available at the Qiime2 Data resources
Silva 138 SSURef NR99 full-length sequences (MD5: de8886bb2c059b1e8752255d271f3010)
Silva 138 SSURef NR99 full-length taxonomy (MD5: f12d5b78bf4b1519721fe52803581c3d)

qiime feature-classifier extract-reads \
  	--i-sequences silva-138-99-seqs.qza \
  	--p-f-primer ANCCATTCAATCGGTANT \
  	--p-r-primer CCATGCATGTCTAAGTWTAA \
  	--o-reads silva-138-99-ref-seqs-region.qza

qiime feature-classifier fit-classifier-naive-bayes \
  	--i-reference-reads silva-138-99-ref-seqs-region.qza \
  	--i-reference-taxonomy silva-138-99-tax.qza \
  	--o-classifier silva-138-99-classifier.qza

qiime feature-classifier classify-sklearn \
  	--i-classifier silva-138-99-classifier.qza \
  	--i-reads rep-seqs.qza \
  	--o-classification taxonomy.qza

qiime metadata tabulate \
  	--m-input-file taxonomy.qza \
  	--o-visualization taxonomy.qzv

Next I proceed with closed reference clustering at 90%

qiime vsearch cluster-features-closed-reference \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --i-reference-sequences silva-138-99-seqs.qza  \
  --p-perc-identity 0.90 \
  --o-clustered-table table-cr-90.qza \
  --o-clustered-sequences rep-seqs-cr-90.qza \
  --o-unmatched-sequences unmatched-cr-90.qza

qiime feature-table summarize \
  --i-table table-cr-90.qza \
  --o-visualization table-cr-90.qzv \
  --m-sample-metadata-file metadata_18s.tsv
qiime feature-table tabulate-seqs \
  --i-data rep-seqs-cr-90.qza \
  --o-visualization rep-seqs-cr-90.qzv

Lastly I construct a taxa barplot

qiime taxa barplot \
  --i-table table-cr-90.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file metadata_18s.tsv \
  --o-visualization taxa-bar-plots-cr-90.qzv

Here is the taxa bar plot

Here are my questions

  1. How can I know the phylum of d__Eukaryota;__?
  2. What is the most correct way to eliminate the phyla that are not Fungi from the taxa bar plot (also the d__Bacteria;__)?
  3. Why is Fungi also classified as phylum? Can I further assign the d__Eukaryota;p__Fungi?
1 Like

Hi @zahir,
Great Questions! Here are my thoughts:

In my experience getting a taxonomic label that isnt very well annotated like d__Eukaryota;__ is a sign that the sequence isnt "known" by the database. I usually work with 16s and what I see is that if I BLAST a sequences that is labeled d_bacteria; only it usually shows up as host DNA that snuck its way in.

You can test this by running qiime feature-table tabulate-seqs ( feature-table - Microbiome marker gene analysis with QIIME 2) with your taxonomy. This will let you look at what labels don't have phylum annotations and by clicking on the sequences it will BLAST your sequence for you. If you are seeing that these are mostly non-target sequences I would filter out any reads that aren't annotated to the phylum (I generally do this filtering for my analysis). Here are the docs on filtering and there is an example of filtering out taxonomic labels that aren't to the phylum: Filtering data — QIIME 2 2024.10.1 documentation

its important to note that the level you select on the taxa barplot is the furtherest resolution you will get, it looks like you have Level 2 selected, which means you will only see down to the phylum.

Its kinda hard to tell in this taxabar plot but there is a difference between d__Eukaryota;__ and d__Eukaryota;p__;c__;o__;f__;g__;. The first one (d__Eukaryota;__ ) means the classifier had really low confidence so its assigned up at the domain level and the second one (d__Eukaryota;p__;c__;o__;f__;g__;__ ) means that a close sequence match was in the database but there is a really poor annotation of it.

Id use the filtering by taxonomy docs I linked above!

Its hard to tell if this is the trucation of the taxonomic label by the taxa barplot or if this is how the your actual taxonomic label. I would use the tips I described above to answer this question as well!

3 Likes

Thanks a lot for your answer

I also conduct the analysis for my 16s amplicon (V1-V9) data using the same procedure
With minor adjustment to suit the 16s data

  • Primer: 27F & 1492R
  • fastp = * Length required = 200 & Maximum length = 800

MultiQC in Galaxy before quality control

MultiQC in Galaxy after quality control

Here is the taxa bar plot

I use the same SILVA database and closed reference at 90%

So here are my questions:

  1. I assume that I can apply the same method of BLAST and filtering data to eliminate the "d__Bacteria;__"?
  2. All of the samples are soil rhizosphere sample, is it normal have almost 90% of d__Bacteria;p__Proteobacteria? I look at some other literatures that using the same kind of sample, it have a huge portion of proteobacteria, but as high as mine. Somehow I am not very confident with my data.
  3. As I have a huge numbers of sequence to be BLAST manually, is there any way to run BLAST for all the "d__Bacteria;__" in one time?

Thanks a lot

Hi @zahir

Yes!

You could try another classifier, or you could download some publically available data as a reference. I personally am not sure about proportions of Proteobacteria.

I dont know if there is a way to do that in QIIME 2. I personally just spot check to make sure that its not consistent contamination

2 Likes

Thanks a lot for your help
much appreciate it

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.