Losing studies during closed reference OTU picking

Hi,

I am trying to do a meta-analysis with 9 different studies. However, when I pass the closed reference OTU picking when SILVA and Grean Gene, I am losing 4 studies by the time I get to the core-metrics, using 1000 sequences per sample.

I think all of the reads from the samples from the 4 studies are not recognized by the closed reference sequences, resulting in no reads in any samples from the 4 missing studies.

Is this normal? Is there another way around this problem?

Thank you,
Clinton

Hi @cbippert,

Have you tried doing clustering on the individual studies so you can see what they look like? It might help to figure out where you’re losing the data.

Best,
Justine

Hi Justine,

I don’t really understand clustering the individual studies? Do you mean passing them individually through the closed reference OTU picking? No, I have not done that.

However, all the data was from previous published papers that were able to cluster their samples together.

I ended up just passing it through the regular:

qiime phylogeny align-to-tree-mafft-fasttree
–i-sequences rep-seqs.qza
–o-alignment aligned-rep-seqs.qza
–o-masked-alignment masked-aligned-rep-seqs.qza
–o-tree unrooted-tree.qza
–o-rooted-tree rooted-tree.qza

thinking that it would fail due to the different hypervariable regions, but it ended up working. I don’t know if that is correct in what I did though.

Clinton

Hi @cbippert,

When you do closed reference OTU picking, you can do it in parallel - meaning that you can pick one sample or 1M at the same time and it doesn't matter. If you're having an issue with the studies, then you should troubleshoot on those individually to solve your problem because it won't affect ht eothers.

I'm a little bit confused by thsi step. If you're doing close-reference OTU picking, then you should just use the tree associated with the closed reference OTUs. You can import the phylogeny and work from there. So, this seems like a weird step in your pipeline to me. Could you explain it fully?

Best,
Justine

Following the “Moving Pictures” tutorial, this is what I thought I should do:

qiime dada2 denoise-single \
  --i-demultiplexed-seqs single-end-demux.qza \
  --p-trim-left-f 0 \
  --p-trunc-len-f 200 \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza

I merged all the rep-seqs.qza and table.qza.

I then passed it through the closed-reference OTU picking. I created the reference-seqs.qza from SILVA:

qiime tools import \
  --input-path silva_132_99_16S.fna\
  --output-path reference-seqs.qza \
  --type 'FeatureData[Sequence]'
qiime vsearch cluster-features-closed-reference \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --i-reference-sequences reference-seqs.qza \
  --p-perc-identity 0.99 \
  --o-clustered-table table-cr-99.qza \
  --o-clustered-sequences rep-seqs-cr-99.qza \
  --o-unmatched-sequences unmatched-cr-99.qza

However, I lost all the samples from 4 of my studies here, because they don’t contain over 1000 sequences in any of the samples.

So instead, I simply used:

qiime phylogeny align-to-tree-mafft-fasttree
–i-sequences rep-seqs.qza
–o-alignment aligned-rep-seqs.qza
–o-masked-alignment masked-aligned-rep-seqs.qza
–o-tree unrooted-tree.qza
–o-rooted-tree rooted-tree.qza

This avoided the closed-reference OTU picking step, allowing me to complete:

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted-tree.qza \
  --i-table table.qza \
  --p-sampling-depth 1000 \
  --m-metadata-file metadata.txt \
  --output-dir core-metrics-results

Hi @cbippert,

If your studies are different hypervariable regions, then current best practice says that you should do closed reference OTU picking. If you’re not mixing hypervariable regions, ASVs are better but must be the same length. Otherwise, you really aren’t working with the same data set.

You definately shouldn’t do MAFFT alignment because its a denovo alignment based on the tree. And, sequences from the same organism will seperate in MAFTT alignment based on hypervariable region. It’s to me questionable (although better) if you do fragment insertion, because then at least you’re working against a reference. (The Fragment Insertion) paper describes this. But, the MAFTT tree is not a good appraoch unless you’re using the same primers and same length sequences.

Your other option is to skip denoising and just go straight into OTUs which might save some sequences (I would still do quality filtering first, just maybe not denoising) and see if it helps with your count problem.

Best,
Justine

Okay, this is what I thought.

I just find it incredible that none of the sequences from the samples match the references sequences.

I will verify again with the individual studies using the closed OTU picking. If it doesn’t work, I might try Fragment Insertion. Another person suggested it. I was just having a hard time figuring out how to download it and combine it with QIIME 2.

Thank you for your help,
Clinton

So I passed one study that had all samples removed by the closed reference OTU picking and received the following error:

qiime vsearch cluster-features-closed-reference \
>   --i-table table.qza \
>   --i-sequences rep-seqs.qza \
>   --i-reference-sequences reference-seqs-gg99.qza \
>   --p-perc-identity 0.99 \
>   --o-clustered-table table-cr-99.qza \
>   --o-clustered-sequences rep-seqs-cr-99.qza \
>   --o-unmatched-sequences unmatched-cr-99.qza

Plugin error from vsearch:

No matches were identified to reference_sequences. This can happen if sequences are not homologous to reference_sequences, or if sequences are not in the same orientation as reference_sequences (i.e., if sequences are reverse complemented with respect to reference sequences). Sequence orientation can be adjusted with the strand parameter.

Debug info has been saved to /tmp/qiime2-q2cli-err-zhkcryhv.log

On a hunch, I investigated the other studies and found them all to be single strand studies. Is it possible that the “sequence orientation” is reversed and if I flip the sequences (from forward to reverse) then more sequences may align to the closed reference sequences?

Hi @cbippert,

I would definitely try that! Its at least worth seeing what they look like.

If you're running 2019.4 or 2019.7, it's included as part of the base install (q2-fragment-insertion). If you're before that, it should be in the plugin library. It's been pretty easy to use with greengenes in my experience. Silva is a bit more complex (read: manual), but Ive found the gg tree has worked pretty well for me.

Best,
Justine

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.