NaMeco - pipeline for nanopore reads

Dear all,

I am glad to share the pipeline for long nanopre 16S rRNA reads I was working on.

The full description can be found on the GitHub page and in the paper.

What it can do:

So, NaMeco will preprocess the reads, count kmers and then perform clustering with UMAP + HDBscan sample by sample. After that, from each sample cluster, representatives will be randomly selected for additional clustering between samples to cluster clusters. New clusters, this time already "shared" between samples, will be polished with a combination of SPOA and Racon. Taxonomy will be assigned based on the GTDB database.

What you will get as an output:

  • cluster_counts.tsv - tab-separated table, shared cluster IDs and absolute counts across all samples.
  • rep_seqs.fasta - representative sequences for each cluster (SPOA + Racon)
  • Taxonomy.tsv - tab-separated table, cluster IDs and taxonomy annotations (ranks by columns), read length and percent identity from blast.
  • Taxonomy-q2.tsv - same as above, but in Qiime2 format (all ranks pooled, separated by ";" and prefixed with "r__", where r is the first character of the rank). It can be imported to qiime2.
  • {rank}.tsv - collapsed to corresponding rank taxonomies with counts

How to install:

wget ``https://raw.githubusercontent.com/timyerg/NaMeco/main/NaMeco.yaml
conda env create --file NaMeco.yaml

How to use:

usage: nameco [-h] --inp_dir INP_DIR [--out_dir OUT_DIR] [--threads THREADS]
              [--qc] [--no-qc] [--phred PHRED] [--min_length MIN_LENGTH]
              [--max_length MAX_LENGTH] [--primer_F PRIMER_F]
              [--primer_R PRIMER_R] [--primer_PI PRIMER_PI] [--kmer KMER]
              [--cluster_size CLUSTER_SIZE] [--subsample SUBSAMPLE]
              [--select_epsilon SELECT_EPSILON] [--db_type DB_TYPE]
              [--db_version DB_VERSION] [--gap GAP]
              [--min_fraction MIN_FRACTION] [--mask_taxa] [--no_masking]
              [--random_state RANDOM_STATE] [--n_polish N_POLISH]
              [--db_path DB_PATH] [--version]

required arguments:
  --inp_dir INP_DIR     Path to the folder with reads, absolute or relative.
                        Reads should be in the fastq or fq format, gziped or
                        not

optional arguments:
  --out_dir OUT_DIR     Path to the directory to store output files, absolute
                        or relative. If not provided, folder "Nameco_out" will
                        be created in working directory
  --threads THREADS     The number of threads/cpus (default 2)
  --qc                  Run chopper for quality control (default)
  --no-qc               Skip chopper for quality control
  --phred PHRED         Minimum phred score for chopper (default 10)
  --min_length MIN_LENGTH
                        Minimum read length for chopper (default 1300)
  --max_length MAX_LENGTH
                        Maximum read length for chopper (default 1700)
  --primer_F PRIMER_F   Forward primer (default AGAGTTTGATCMTGGCTCAG)
  --primer_R PRIMER_R   Reverse primer (default CGGTTACCTTGTTACGACTT)
  --primer_PI PRIMER_PI
                        Percent identity for primers (default 0.6)
  --kmer KMER           K-mer length for clustering (default 5)
  --cluster_size CLUSTER_SIZE
                        Min. unique cluster size (default 10, can't be < 10)
  --subsample SUBSAMPLE
                        Subsample clusters for consensus creation and
                        polishing (default 200)
  --select_epsilon SELECT_EPSILON
                        Selection epsilon for clusters (default 0.1)
  --db_type DB_TYPE     Use all rRNAs from GTDB ("All", higher accuracy,
                        slower) or only representative species ("SpeciesReps",
                        lower accuracy, faster) (default "All")
  --db_version DB_VERSION
                        GTDB version. Choices: "202.0", "207.0", "214.0",
                        "214.1", "220.0", "226.0" (default "226.0")
  --gap GAP             Gap between the bit score of the best hit and others,
                        that are considered with the top hit for taxonomy
                        selection (default 1)
  --min_fraction MIN_FRACTION
                        If numerous hits retained after gap filtering,
                        consensus taxon should have at least this fraction to
                        be selected. Otherwise set as lower level +
                        unclassified (default 0.6)
  --mask_taxa           Mask taxonomy ranks based on percent identity
                        thresholds (default "True"). Thresholds are: d: 65, p:
                        75, c: 78.5,o: 82, f: 86.5, g: 94.5, s: 97
  --no_masking          Skip masking taxonomy step
  --random_state RANDOM_STATE
                        Random state for subsampling (default 888)
  --n_polish N_POLISH   Number of polishing rounds (default 3)
  --db_path DB_PATH     Path to store/existing database (default
                        $out_dir/$database). Please use only databases,
                        created by previous NaMeco run to avoid errors
  --version             Check the version

Example:

conda activate NaMeco

nameco --inp_dir Samples --threads 20

How to import the output into Qiime2:

The output can be exported to qiime2 for further analyses. For example, cluster_count.tsv, rep_seqs.fasta and Taxonomy-q2.tsv can used for alpha/beta diversity analyses (including phylogenetic metrics), while taxa counts can be used for differential abundance tests.

#can be done in your qiime2 env. or in NaMeco

mkdir to_qiime2

#import feature table
biom convert \
    -i NaMeco_out/Final_output/cluster_counts.tsv \
    -o to_qiime2/table.biom \
    --table-type="OTU table" \
    --to-hdf5

qiime tools import \
    --input-path to_qiime2/table.biom \
    --type 'FeatureTable[Frequency]' \
    --input-format BIOMV210Format \
    --output-path to_qiime2/table.qza

#import taxonomy
qiime tools import \
    --type 'FeatureData[Taxonomy]' \
    --input-path NaMeco_out/Final_output/Taxonomy-q2.tsv \
    --output-path to_qiime2/taxonomy.qza

#import representative sequences
qiime tools import \
    --type 'FeatureData[Sequence]' \
    --input-path NaMeco_out/Final_output/rep_seqs.fasta \
    --output-path to_qiime2/rep-seq.qza

If you want to import taxa counts, you can either import them directly from NaMeco output, similarly to the cluster_count.tsv, or just collapse the cluster_count table to the desired level within qiime2.

Important:

  • All samples that are compared to each other should be run together in one pool, even from different sequencing runs. Do not merge different NaMeco runs at Cluster level since Cluster IDs would not match. If needed, we recommend to merge different NaMeco runs at taxonomy level.

  • Using multiple threads can significantly speed up the NaMeco run

  • If you are facing issues with the drive space on your working drive, export tmp directory before running NaMeco: "export TMPDIR=/big_storage_path/TMP"

  • "Creating database" can take a long time, just wait until it is done. Good time to make some coffee!

Best wishes and good luck,

Timur

3 Likes