Taxonomic databases for sidle (Ampliseq pipeline, multiregion sidle option)

Clasicac · September 30, 2024, 1:51pm

Hi,
I am trying to use (nextflow) ampliseq pipeline (https://github.com/nf-core/ampliseq ) with the multi region option which use sidle. For the taxonomy step, sidle can use 2 readily databases (silva or GreenGenes2). Unfortunately, those databases have only 16S. I am working with ITS plant, rbcl plant, 12S bird and 12S mammal samples.

I would like to know if there is other available databases for sidle which could fit with my samples (ITS or Rbcl or 12S) ?
If this doesn't exist, could you advise me on how to make a customized database that works with sidle ?

Thank you in advance for your help!

colinbrislawn · September 30, 2024, 3:35pm

For ITS I've used the Unite-Database.

Unite is also supported by RESCRIPt, which can be helpful in building databases.

How many different amplicon regions did you target in total? ITS, rbcl, 12S for 3 total?

P.S. Welcome to the forums!

Clasicac · October 3, 2024, 12:53pm

Thank you very much for your kind and useful answer !

On your advice, I used Unite with my ampliseq pipeline. Unfortunatly, the taxonomy is wrong (I got Cercozoa sp. instead of plants). I blast on ncbi the ASV sequences and it matches with my plants. So I think there is maybe a problem with the Unite database I used. I will check this out.

Currently, I have 5 different amplicons. ITS and Rbcl which are made separatly (simplex PCR) and 12S + 2 16S which are made together (multiplex PCR). I guess that I will need to create 3 different databases ( one ITS, one Rbcl and one 12S + 16S + 16S). I'll take a closer look at how RESCRIPt works and keep you posted on my progress !

hindrek · November 14, 2024, 9:18am

Hi @Clasicac

I am also interested in running nf-core/ampliseq for taxonomic assignments with multi-region ITS amplicons using SIDLE. For custom database, nf-core/ampliseq (v2.11.0) requires three input files - fasta, aligned fasta, and taxonomy. UNITE database 10.0 (QIIME release) includes fasta and taxonomy files, but no aligned fasta. How did you manage to create the missing aligned fasta?

Best
Hindrek

jwdebelius · November 14, 2024, 3:05pm

Hi @hindrek,

The aligned file is required for tree construction. AFAIK, there isn't an insertion backbone for UNITE right now. So, you shoudl be able to get away with just the sequence and taxonomy, with the clear caveat of no phylogeny.

IIRC phylogeny is difficult for fungi anyway?

Best,
Justine

hindrek · November 22, 2024, 10:18am

Hi @jwdebelius

Thank you for the clarification! I can't comment on the phylogeny of fungi, as I have no prior experience with fungi.

I modified the nf-core/ampliseq pipeline to bypass the requirement for an aligned file. I am targeting the ITS1 and ITS2 regions. Unfortunately, the feature classifier for the ITS2 region returned 'No matches found':

Caused by:
  Process `NFCORE_AMPLISEQ:AMPLISEQ:SIDLE_WF:SIDLE_DBEXTRACT (ITS2,100)` terminated with an error exit status (1)


Command executed:

  # https://q2-sidle.readthedocs.io/en/latest/database_preparation.html#prepare-a-regional-database-for-each-primer-set
  export XDG_CONFIG_HOME="./xdgconfig"
  export MPLCONFIGDIR="./mplconfigdir"
  export NUMBA_CACHE_DIR="./numbacache"
  
  #extract sequences
  qiime feature-classifier extract-reads \
      --p-n-jobs 6 \
      --i-sequences db_filtered_sequences.qza \
      --p-identity 2 \
      --p-f-primer AACTTTYRRCAAYGGATCWCT \
      --p-r-primer AGCCTCCGCTTATTGATATGCTTAART \
      --o-reads db_ITS2.qza
  
  #prepare to be used in alignment
  qiime sidle prepare-extracted-region \
      --p-n-workers 6 \
      --i-sequences db_ITS2.qza \
      --p-region "ITS2" \
      --p-fwd-primer AACTTTYRRCAAYGGATCWCT \
      --p-rev-primer AGCCTCCGCTTATTGATATGCTTAART \
      --p-trim-length 100 \
      --o-collapsed-kmers db_ITS2_100_kmers.qza \
      --o-kmer-map db_ITS2_100_map.qza
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_AMPLISEQ:AMPLISEQ:SIDLE_WF:SIDLE_DBEXTRACT":
      qiime2: $( qiime --version | sed '1!d;s/.* //' )
      qiime2 plugin sidle: $( qiime sidle --version | sed 's/ (.*//' | sed 's/.*version //' )
      q2-sidle: $( qiime sidle --version | sed 's/.*version //' | sed 's/)//' )
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
  Plugin error from feature-classifier:
  
    No matches found
  
  Debug info has been saved to /tmp/qiime2-q2cli-err-hpf5_hoc.log

Analyzing the ITS1 and ITS2 regions independently using the nf-core/ampliseq pipeline (configured with the default single-region setup) works fine.

Best
Hindrek

jwdebelius · November 22, 2024, 4:45pm

Hi @hindrek,

So, I will caveat this with the fact that I've not run with nextflow, and tend ot run sidle locally. So, if its a nextflow issue, we may nee dot see that.

Would it be possible to share that full output log file so we can check it?

I'm not sure why the command would fail serially but not int he nextflow workflow. I have some other (potentially stupid) ideas that need more testing but wouldn't be implemented in nf-core.

Best,
Justine

hindrek · November 28, 2024, 7:36am

Hi @jwdebelius

The full log file output:

Traceback (most recent call last):
  File "/opt/conda/envs/sidle-0.1.0-beta/lib/python3.8/site-packages/q2cli/commands.py", line 329, in __call__
    results = action(**arguments)
  File "<decorator-gen-119>", line 2, in extract_reads
  File "/opt/conda/envs/sidle-0.1.0-beta/lib/python3.8/site-packages/qiime2/sdk/action.py", line 244, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/opt/conda/envs/sidle-0.1.0-beta/lib/python3.8/site-packages/qiime2/sdk/action.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/opt/conda/envs/sidle-0.1.0-beta/lib/python3.8/site-packages/q2_feature_classifier/_cutter.py", line 215, in extract_reads
    raise RuntimeError("No matches found")
RuntimeError: No matches found

Best
Hindrek

Nicholas_Bokulich · November 29, 2024, 9:15am

Hi @hindrek ,

The error message seems clear enough: you do not have any sequences that contain both primers.

You should spot-check a few to make sure... issues with orientation etc could always be involved (though I think this action checks both orientations)

More likely issue: UNITE has a "developer" version (with untrimmed seqs) and a regular version (trimmed to the ITS domain). The primers sit outside of the ITS domain proper, in the conserved SSU or 5.8S or LSU domains. You are probably using the regular version, hence trimmed and hence no hits.

Good luck!