More than 50% taxonomy is unassigned

I have been trying to assign taxonomy but whichever classifier i use more than 50 % of them are unassigned ,previously i did the same for multiple files from a bioproject it worked fine there were very less unassigned taxa but now i have taken a subset of those files and more than half seem to be unassigned .
I have tried training my own classifier using RESCRIPt,didnt work .I have tried different pretrained classifiers ,weighted classifiers but nothing seems to work.It would be great to know what im doing wrong .

Hi @Reeshma_Hussain ,

Welcome to the :qiime2: forum. I think if you give everyone here a little bit more information it will allow them to give a more specific answer and help people troubleshoot your issue.
For example:
What amplicon/taxa target are your data?
What environment are your data from?
What process did you use to clean up and quality control your data and how did it turn out (i.e. adapter trimming via cutadapt, denoising via DADA2 etc)?
What assignment method are you trying to use (feature-classifier sklearn)?

This way the community on the forum can help :+1:

If you have tried lots of classifiers and they all don't work adequately it might suggest that the issue lies elsewhere, rather than the classifier itself, so could be upstream of this step.

best,

Vic

6 Likes

sorry let me clarify.
It is vaginal microbiome ,16s sequences.I used DADA2 for denoising.I used pretrained silva classifier for v4 region from the data resources page " Silva 138 99% OTUs from 515F/806R region of sequences"this one.I also used the weighted one.
I hope this answers the question.

if upstream process had some issues it would have been the case for my previous taxonomy assignment too because both are from the same bioproject and i have use the same parameters.

"qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest.tsv
--output-path paired-end-demux.qza
--input-format PairedEndFastqManifestPhred33V2

qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-max-ee-f 2
--p-max-ee-r 2
--p-trunc-q 2
--p-trunc-len-f 245
--p-trunc-len-r 235
--p-n-threads 8
--o-table 16S-table.qza
--o-representative-sequences 16S-rep-seqs.qza
--o-denoising-stats 16S-denoising-stats.qza
--verbose

qiime metadata tabulate
--m-input-file 16S-denoising-stats.qza
--o-visualization 16S-denoising-stats.qzv

qiime feature-table summarize
--i-table 16S-table.qza
--o-visualization 16S-table.qzv
--m-sample-metadata-file metadata.tsv

qiime feature-table tabulate-seqs
--i-data 16S-rep-seqs.qza
--o-visualization 16S-rep-seqs.qzv

qiime feature-classifier classify-sklearn
--i-classifier 515f-806r-uniform-classifier.qza
--i-reads 16S-rep-seqs.qza
--o-classification taxonomy.qza

qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv"

I am new to this so please do let me know if i'm missing something.
Thank you .

2 Likes

Hi Reeshma,

Thanks for all the extra information, thats really helpful. I think you are right about this:

Would you be willing to post the two taxonomy.qza files from the two projects so we could take a look at thei provenance?

Another thing that might help is if you can share the two 16S-denoising-stats.qza files because DADA2 is another place where things can go wrong upstream.

all the best,

Vic

1 Like

These are new taxonomy.qza and denoising-stats.qza
16S-denoising-stats.qza (15.6 KB)
taxonomy.qza (148.5 KB)

2 Likes

I think i have deleted the previous taxonomy.qza and the denoising.qza.Is there anything else i could provide that will help.
i have attached the level 5 bars which is the only thing i have .


Thank you in advance

1 Like

Hi again @Reeshma_Hussain ,

Thanks for those, I'll take a look. One more question, there are lots of samples in your graph, were they all processed together in DADA2 and are they all from one sequencing run?

cheers,

Vic

3 Likes

Hi,
Initially I had taken a subset of 800 samples from 2300 ,from a bioproject with the accession id PRJNA393472 but later realized they were longitudinal ,so had go back do it with one set of samples.
As for your question ,yes they were all processed together in DADA2.
Thank you very much for the quick reply.

Hi again, :wave:

I would double check that these are all from the same sequencing run. This is because DADA2 estimates error profiles from the data itself, so you should use it per sequencing run. While these are all under the same bioproject, that doesn’t mean they were sequenced all on the same run. A bioproject can contain many runs as a bioproject is all the sequencing for one project or initiative. I would say 800 or 2300 samples per run would seem a lot to me, but I must hold my hands up and say I’ve never worked with human data. I would have a look at the metadata supplied to the SRA and double check.

2 Likes

Sorry if this a stupid question but what is the relevance to the sequencing runs? this is my absolute first time doing any analysis ,i have attached the metadata of the bioproject ,there was a column "IL_run" which has IL06 and so on .May be that has something to do with the sequencing run?
sra_run_meta.csv (1.5 MB)

Thank you so much.

Hi again,

Yes, it looks to me like “IL_Run” means Illumina run. However, I don't think “IL_Run” is a standard header required by the SRA, so, whoever produced the data has added that information, which is very helpful. But if you wanted to be sure you could always email the lead author of the paper it was published in and double check.

But you need to denoise the runs seperately and then use the merge function on the resulting feature table and rep-seqs, which would go something like this:

qiime feature-table merge \
  --i-tables A_table.qza \
  --i-tables B_table.qza \
   --o-merged-table merged-table.qza

qiime feature-table merge-seqs \
  --i-data A_rep-seqs.qza \
  --i-data B_rep-seqs.qza \
  --o-merged-data merged-rep-seqs.qza
2 Likes

Hi ,Just to be sure for each run ill make a separate manifest file and denoise them and then merge them?

1 Like

Yep exactly!
Make separate manifests, import them separately, denoise them separately and lastly merge the tables and rep-seqs that come out of your denoising step!

3 Likes

Thank you so much.I was stuck for a while here.

i would like to know what difference would it make if i denoise them separately but using the same parameters ?

Hi again, :wave:

I tried to touch on this earlier:

But specifically, the reason you need to denoise each sequence run separately is because DADA2 has an error model that is uses the data itself, each sequencing run will have a different error profile. So, combining runs will just confuse the model.

I hope thats helpful :blush:

3 Likes