More than 50% taxonomy is unassigned

Reeshma_Hussain · May 31, 2024, 12:11pm

I have been trying to assign taxonomy but whichever classifier i use more than 50 % of them are unassigned ,previously i did the same for multiple files from a bioproject it worked fine there were very less unassigned taxa but now i have taken a subset of those files and more than half seem to be unassigned .
I have tried training my own classifier using RESCRIPt,didnt work .I have tried different pretrained classifiers ,weighted classifiers but nothing seems to work.It would be great to know what im doing wrong .

buzic · May 31, 2024, 12:58pm

Hi @Reeshma_Hussain ,

Welcome to the :qiime2: forum. I think if you give everyone here a little bit more information it will allow them to give a more specific answer and help people troubleshoot your issue.
For example:
What amplicon/taxa target are your data?
What environment are your data from?
What process did you use to clean up and quality control your data and how did it turn out (i.e. adapter trimming via cutadapt, denoising via DADA2 etc)?
What assignment method are you trying to use (feature-classifier sklearn)?

This way the community on the forum can help

If you have tried lots of classifiers and they all don't work adequately it might suggest that the issue lies elsewhere, rather than the classifier itself, so could be upstream of this step.

best,

Vic

Reeshma_Hussain · June 1, 2024, 8:04am

sorry let me clarify.
It is vaginal microbiome ,16s sequences.I used DADA2 for denoising.I used pretrained silva classifier for v4 region from the data resources page " Silva 138 99% OTUs from 515F/806R region of sequences"this one.I also used the weighted one.
I hope this answers the question.

if upstream process had some issues it would have been the case for my previous taxonomy assignment too because both are from the same bioproject and i have use the same parameters.

"qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest.tsv
--output-path paired-end-demux.qza
--input-format PairedEndFastqManifestPhred33V2

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-max-ee-f 2
--p-max-ee-r 2
--p-trunc-q 2
--p-trunc-len-f 245
--p-trunc-len-r 235
--p-n-threads 8
--o-table 16S-table.qza
--o-representative-sequences 16S-rep-seqs.qza
--o-denoising-stats 16S-denoising-stats.qza
--verbose

qiime metadata tabulate
--m-input-file 16S-denoising-stats.qza
--o-visualization 16S-denoising-stats.qzv

qiime feature-table summarize
--i-table 16S-table.qza
--o-visualization 16S-table.qzv
--m-sample-metadata-file metadata.tsv

qiime feature-table tabulate-seqs
--i-data 16S-rep-seqs.qza
--o-visualization 16S-rep-seqs.qzv

qiime feature-classifier classify-sklearn
--i-classifier 515f-806r-uniform-classifier.qza
--i-reads 16S-rep-seqs.qza
--o-classification taxonomy.qza

qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv"

I am new to this so please do let me know if i'm missing something.
Thank you .

buzic · June 3, 2024, 12:06pm

Hi Reeshma,

Thanks for all the extra information, thats really helpful. I think you are right about this:

Would you be willing to post the two taxonomy.qza files from the two projects so we could take a look at thei provenance?

Another thing that might help is if you can share the two 16S-denoising-stats.qza files because DADA2 is another place where things can go wrong upstream.

all the best,

Vic

Reeshma_Hussain · June 4, 2024, 8:43am

These are new taxonomy.qza and denoising-stats.qza
16S-denoising-stats.qza (15.6 KB)
taxonomy.qza (148.5 KB)

Reeshma_Hussain · June 4, 2024, 8:44am

I think i have deleted the previous taxonomy.qza and the denoising.qza.Is there anything else i could provide that will help.
i have attached the level 5 bars which is the only thing i have .

Thank you in advance

buzic · June 4, 2024, 12:54pm

Hi again @Reeshma_Hussain ,

Thanks for those, I'll take a look. One more question, there are lots of samples in your graph, were they all processed together in DADA2 and are they all from one sequencing run?

cheers,

Vic

Reeshma_Hussain · June 4, 2024, 4:46pm

Hi,
Initially I had taken a subset of 800 samples from 2300 ,from a bioproject with the accession id PRJNA393472 but later realized they were longitudinal ,so had go back do it with one set of samples.
As for your question ,yes they were all processed together in DADA2.
Thank you very much for the quick reply.

buzic · June 5, 2024, 10:27am

Hi again,

I would double check that these are all from the same sequencing run. This is because DADA2 estimates error profiles from the data itself, so you should use it per sequencing run. While these are all under the same bioproject, that doesn’t mean they were sequenced all on the same run. A bioproject can contain many runs as a bioproject is all the sequencing for one project or initiative. I would say 800 or 2300 samples per run would seem a lot to me, but I must hold my hands up and say I’ve never worked with human data. I would have a look at the metadata supplied to the SRA and double check.

Reeshma_Hussain · June 9, 2024, 9:44am

Sorry if this a stupid question but what is the relevance to the sequencing runs? this is my absolute first time doing any analysis ,i have attached the metadata of the bioproject ,there was a column "IL_run" which has IL06 and so on .May be that has something to do with the sequencing run?
sra_run_meta.csv (1.5 MB)

Thank you so much.

buzic · June 10, 2024, 8:06am

Hi again,

Yes, it looks to me like “IL_Run” means Illumina run. However, I don't think “IL_Run” is a standard header required by the SRA, so, whoever produced the data has added that information, which is very helpful. But if you wanted to be sure you could always email the lead author of the paper it was published in and double check.

But you need to denoise the runs seperately and then use the merge function on the resulting feature table and rep-seqs, which would go something like this:

qiime feature-table merge \
  --i-tables A_table.qza \
  --i-tables B_table.qza \
   --o-merged-table merged-table.qza

qiime feature-table merge-seqs \
  --i-data A_rep-seqs.qza \
  --i-data B_rep-seqs.qza \
  --o-merged-data merged-rep-seqs.qza

Reeshma_Hussain · June 10, 2024, 12:46pm

Hi ,Just to be sure for each run ill make a separate manifest file and denoise them and then merge them?

cherman2 · June 10, 2024, 4:57pm

Yep exactly!
Make separate manifests, import them separately, denoise them separately and lastly merge the tables and rep-seqs that come out of your denoising step!

Reeshma_Hussain · June 11, 2024, 6:31am

Thank you so much.I was stuck for a while here.

Reeshma_Hussain · June 11, 2024, 10:31am

i would like to know what difference would it make if i denoise them separately but using the same parameters ?

buzic · June 11, 2024, 12:41pm

Hi again,

I tried to touch on this earlier:

But specifically, the reason you need to denoise each sequence run separately is because DADA2 has an error model that is uses the data itself, each sequencing run will have a different error profile. So, combining runs will just confuse the model.

I hope thats helpful

system · July 12, 2024, 6:41pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.