problems of classification with low biomass samples

Hi forum,

I am writing because I am in the following strange situation:
I have a set of patient derived low biomass 16S data.

I have problems related to uneven coverage I tried to fix by subsampling however, even if this trick was useful in another setting this time I have still problems of classifciation in the sense that I find mostly "bacteria", OD1 ir unclassified, this is not in line with previous results and I suspected a technical issue.
I imagine that I have one or more sample disturbing in a way the classification (I use vsearch) I have tried with only one pair of samples and indeed I have foound comfortable results.

So I have two points:

  • is there a way to identify low quality samples (maybe due to low complexity of their libraries) to exclude them from the classifying?
  • proceding slowly sample by sample would it be safe to merge the resulting taxonomy tables at least (I do not think so also becasue I would also need to calculate the tree separately and so on)

I thank the ones who could help me to solve this issue

Michela

@MichelaRiba,

Could you post the commands you ran, a screenshot of your results, and what you were expecting? Depending on the region you are targeting, there may simply not enough information in your sequences to make a more precise classification, particularly with a low biomass sample, where the quality control tools have less information to work with. Can you also describe the steps you took to prepare your data for classification?

Hi,

thanks for your kind reply.

Maybe it is not thant clear my point is how to detect problematic samples in low biomass experiments since the use of representative seuqneces from a set of samples in vsearc algorith is influenced by those to a point of no classification, restored if "good" samples only are used. What could be a suggestion to separate the samples based on their "complexity" or so?
The same can happen if you try to classify in the same batch samples from low biomass (e.g. urine) together with high biomass (e.g. gut) however here I cannot do based on metadata because they come from the same kind of tissue. Sorry for not including the code, since it is a generalk problem related to the sample composition. Thanks to whom has some insight or suggestion or maybe has faced the same problem. Michela

@MichelaRiba,

Problems with low biomass samples can be a tricky; in fact there is ongoing work on a plugin specifically to address some of the associated issues. It is definitely not an area that I am an expert in, let me reach out to a few of the other mods to get their input and get back to you.

1 Like

Hello @MichelaRiba,

One assumption that seems to be underlying the thought process here is that there is interaction between features during the classification process. Is the action that you are talking about qiime feature-classifier classify-consensus-vsearch? My understanding is that this uses alignment-based classification, and so features are classified independently.

1 Like

Hi @Keegan-Evans thanks a lot for answering and please contact me if you sould share some suggestions from reseachers in the field of low biomass samples. Thanks a lot again,

Michela

Hi @colinvwood ,

thanks for your insight.
The vsearc commands I use are the following:

joined_import_filter_derep_OTU:
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\ # I have a conda environment
	qiime vsearch cluster-features-open-reference \
	--i-table table.qza \
	--i-sequences rep-seqs.qza \
	--i-reference-sequences 85_otus.qza \
	--p-perc-identity 0.85 \
	--o-clustered-table table-or-85.qza \
	--o-clustered-sequences rep-seqs-or-85.qza \
	--o-new-reference-sequences new-ref-seqs-or-85.qza \
	--p-threads 8 \
	--verbose 


taxa_classi:
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\
	qiime feature-classifier classify-sklearn \
	--i-classifier gg-13-8-99-nb-classifier.qza \
	--i-reads rep-seqs-or-85.qza \
	--o-classification taxonomy10C.qza

I would like to add the following:
I do not experience problems in classification when dealing with gut microbiome samples, even if I had to consider in one situation to use subsampling to "normalize" the amount of input sequences for all samples, we had very different coverage among them (e.g. 6,000 to 2,000,000) and this helped and somehow succeeded in solving the problem of classification.

However I am now facing again the problem in an experiment of low biomass and subsampling to have even coverage is not sufficient to solve, I suspect that the original libraries have very different complexities and different sequences coming up so I suppose I am in a situation of trying and compare hardly very different samples.

If you have suggestions I am very happy to do

thanks a lot,

Michela

Hi Michela,
Can you tell us why you're running cluster-features-open-reference? That's generally not something that's done, and particularly with such a low --p-perc-identity setting, this will reduce your taxonomic resolution. I would recommend performing your taxonomic classification on the rep-seqs.qza with your classify-sklearn command - you should get much better results in that case. Unless you have a very specific reason for using cluster-features-open-reference, and using it with 85% identity, I think you should skip that step all together.

As @colinvwood pointed out, including different sample types wouldn't be a problem for taxonomic classification. However, that could cause a problem with the way you're using cluster-features-open-reference.

EDIT: I see that we use 85% as the percent identity threshold in the tutorial, but be aware that the note above that command says that this threshold is only for the purpose of this example and that a higher value should be used. If you do want to perform open-reference clustering, I recommend a threshold of 99% (0.99), but I still think this likely isn't needed for your analysis.

Hi @gregcaporaso thanks for your kind reply and suggestions.
The very reason I used the command is given by the fact that I was using that with qiime1 and spent some time also for debugging one of the releases of qiime 2 since there were problems with time of completing the command, solved also thanks to my pointing that out.

The reason for 0.85 is the one you pointed out: I have followed the tutorial and not changed in time.

may I ask you why with gut samples the problem does not manifest while with low biomass yes?

To conclude:
Should I finally run only the following

taxa_classi:
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\
	qiime feature-classifier classify-sklearn \
	--i-classifier gg-13-8-99-nb-classifier.qza \
	--i-reads rep-seqs-or-85.qza \
	--o-classification taxonomy10C.qza

suppressing in toto the previous step?

joined_import_filter_derep_OTU:
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\ # I have a conda environment
	qiime vsearch cluster-features-open-reference \
	--i-table table.qza \
	--i-sequences rep-seqs.qza \
	--i-reference-sequences 85_otus.qza \
	--p-perc-identity 0.85 \
	--o-clustered-table table-or-85.qza \
	--o-clustered-sequences rep-seqs-or-85.qza \
	--o-new-reference-sequences new-ref-seqs-or-85.qza \
	--p-threads 8 \
	--verbose 

so the input sequence for the classification are not clustered sequences coming from the following command:

joined_import_filter_derep:
	export HDF5_USE_FILE_LOCKING='FALSE';\
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\
	qiime vsearch dereplicate-sequences \
	--i-sequences fil_joined.qza \
	--o-dereplicated-table table.qza \
	--o-dereplicated-sequences rep-seqs.qza

in the coming to the command

taxa_classi:
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\
	qiime feature-classifier classify-sklearn \
	--i-classifier gg-13-8-99-nb-classifier.qza \
	--i-reads rep-seqs.qza \ #  instead of clustered sequences **rep-seqs-or-85.qza** 
	--o-classification taxonomy10C.qza

Could you please confirm?

I thank you so much

Michela

Hi @gregcaporaso

using the following commands, skipping the clustering part I do see classifcation

taxa_classi_no_clustered:
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\
	qiime feature-classifier classify-sklearn \
	--i-classifier gg-13-8-99-nb-classifier.qza \
	--i-reads rep-seqs.qza \
	--o-classification taxonomy10C_no_cluster.qza


taxa_classi_no_clustered_EXPORT:
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\
	qiime tools export \
	--input-path taxonomy10C_no_cluster.qza \
	--output-path taxonomy10C_no_cluster-EXP

taxa_classi_no_clusteredvis:
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\
	qiime taxa barplot \
	--i-table table.qza \
	--i-taxonomy taxonomy10C_no_cluster.qza \
	--m-metadata-file mapUjoin_RibaM.tsv \
	--o-visualization taxa-bar-plotst10C_no_cluster.qzv 

taxa_classi_vis_no_clustered_EXPORT:
	$(CONDA_ACTIVATE) Miqiime2-2021.8;\
	qiime tools export \
	--input-path taxa-bar-plotst10C_no_cluster.qzv \
	--output-path taxa-bar-plots10C.qzv-EXP

Could you please confirm this is safe enough: avoid clustering of sequences and proceed directly to taxa classification?

Thanks a lot

Michaela

1 Like

Hi @MichelaRiba,
Your commands look right to me and this is the most standard way to approach the analysis these days. I'm glad to hear that you're getting better results now! If you would like to share your taxa-bar-plotst10C_no_cluster.qzv file, I can take a look at the data provenance and confirm that the workflow looks correct.

Hi,
thanks a lot.
Sure I am honoured to get your help. Just a while to check if some additonal anonymization should be done for sample names.

May I have a private message to answer and include the files? Thank you so much!

Michela

Hi @MichelaRiba,
Thanks for sharing those files by DM. I took a look at the data provenance and the workflow without the open-reference clustering is a reasonable approach, but not exactly what I would recommend. The quality control approach (qiime quality-filter q-score) and feature definition approach (qiime vsearch dereplicate-sequences) you're using is a bit outdated.

I see you're importing pre-joined sequences (i.e., SampleData[JoinedSequencesWithQuality]). Is it possible for you to import the raw paired end reads instead?

If so, I would recommend doing that, and then passing the output of your import step directly to QIIME 2's DADA2 plugin (qiime dada2 denoise-paired). This will take care of quality control, paired end read joining, and creating your feature table and asv sequence data. You would then continue with your qiime feature-classifier classify-sklearn step.

If you don't have the option of starting with the paired end (unmerged) reads, you can pass the results of your qiime quality-filter q-score command through qiime deblur denoise-16S.

Both of these pathways are illustrated in the QIIME 2 Moving Pictures tutorial, so I recommend using that as a reference for your analysis.

Hi,

thanks a lot again for going into the details of my files.

Concenring the opportunity to follow the "best pathway" the limitation I have is that I start with already demultiplexed sequences, and if I am not wrong the workflow starts from this step.
For this reason I used to join and quality filter the sequences using the tool (fastq-join) used also in qiime1. This way I start directly with joined, quality filtered sequences. Does it make sense?

Michela

Hi @MichelaRiba, It is possible to import the demultiplexed sequences and start from there with the workflow that I described. If you go that route, I recommend not doing any of the joining or quality control before importing as DADA2 expects to be provided with the raw data.

This section of the docs, and the sections below it provide instructions for importing demultiplexed sequences. This should be very similar to the approach that you took for importing your joined sequences. Let us know if you run into additional questions, and good luck with your analysis!

1 Like

Hi,

back to you again,

  • I have tried the option of no-cluster before classification, I have to make this observation: due to higher number of sequences (not sequences representative of a cluster but all) the procedure (starting from joined sequences) is slower and I would like to be sure that it could scale since I am waiting for a 10X number of samples.

so, following your suggestion about best recommended practices:

  • I am starting using the option "directly input raw data".

I find the dada2 command going on, little bit slow I think for including many steps.

I used parameter fro truncating Fw read 120 and reverse read 130, hoping this would fit based on the statistics

Thanks a lot!

Michela