Dear all,
I am glad to share the pipeline for long nanopre 16S rRNA reads I was working on.
The full description can be found on the GitHub page and in the paper.
What it can do:
So, NaMeco will preprocess the reads, count kmers and then perform clustering with UMAP + HDBscan sample by sample. After that, from each sample cluster, representatives will be randomly selected for additional clustering between samples to cluster clusters. New clusters, this time already "shared" between samples, will be polished with a combination of SPOA and Racon. Taxonomy will be assigned based on the GTDB database.
What you will get as an output:
- cluster_counts.tsv - tab-separated table, shared cluster IDs and absolute counts across all samples.
- rep_seqs.fasta - representative sequences for each cluster (SPOA + Racon)
- Taxonomy.tsv - tab-separated table, cluster IDs and taxonomy annotations (ranks by columns), read length and percent identity from blast.
- Taxonomy-q2.tsv - same as above, but in Qiime2 format (all ranks pooled, separated by ";" and prefixed with "r__", where r is the first character of the rank). It can be imported to qiime2.
- {rank}.tsv - collapsed to corresponding rank taxonomies with counts
How to install:
wget ``https://raw.githubusercontent.com/timyerg/NaMeco/main/NaMeco.yaml
conda env create --file NaMeco.yaml
How to use:
usage: nameco [-h] --inp_dir INP_DIR [--out_dir OUT_DIR] [--threads THREADS]
[--qc] [--no-qc] [--phred PHRED] [--min_length MIN_LENGTH]
[--max_length MAX_LENGTH] [--primer_F PRIMER_F]
[--primer_R PRIMER_R] [--primer_PI PRIMER_PI] [--kmer KMER]
[--cluster_size CLUSTER_SIZE] [--subsample SUBSAMPLE]
[--select_epsilon SELECT_EPSILON] [--db_type DB_TYPE]
[--db_version DB_VERSION] [--gap GAP]
[--min_fraction MIN_FRACTION] [--mask_taxa] [--no_masking]
[--random_state RANDOM_STATE] [--n_polish N_POLISH]
[--db_path DB_PATH] [--version]
required arguments:
--inp_dir INP_DIR Path to the folder with reads, absolute or relative.
Reads should be in the fastq or fq format, gziped or
not
optional arguments:
--out_dir OUT_DIR Path to the directory to store output files, absolute
or relative. If not provided, folder "Nameco_out" will
be created in working directory
--threads THREADS The number of threads/cpus (default 2)
--qc Run chopper for quality control (default)
--no-qc Skip chopper for quality control
--phred PHRED Minimum phred score for chopper (default 10)
--min_length MIN_LENGTH
Minimum read length for chopper (default 1300)
--max_length MAX_LENGTH
Maximum read length for chopper (default 1700)
--primer_F PRIMER_F Forward primer (default AGAGTTTGATCMTGGCTCAG)
--primer_R PRIMER_R Reverse primer (default CGGTTACCTTGTTACGACTT)
--primer_PI PRIMER_PI
Percent identity for primers (default 0.6)
--kmer KMER K-mer length for clustering (default 5)
--cluster_size CLUSTER_SIZE
Min. unique cluster size (default 10, can't be < 10)
--subsample SUBSAMPLE
Subsample clusters for consensus creation and
polishing (default 200)
--select_epsilon SELECT_EPSILON
Selection epsilon for clusters (default 0.1)
--db_type DB_TYPE Use all rRNAs from GTDB ("All", higher accuracy,
slower) or only representative species ("SpeciesReps",
lower accuracy, faster) (default "All")
--db_version DB_VERSION
GTDB version. Choices: "202.0", "207.0", "214.0",
"214.1", "220.0", "226.0" (default "226.0")
--gap GAP Gap between the bit score of the best hit and others,
that are considered with the top hit for taxonomy
selection (default 1)
--min_fraction MIN_FRACTION
If numerous hits retained after gap filtering,
consensus taxon should have at least this fraction to
be selected. Otherwise set as lower level +
unclassified (default 0.6)
--mask_taxa Mask taxonomy ranks based on percent identity
thresholds (default "True"). Thresholds are: d: 65, p:
75, c: 78.5,o: 82, f: 86.5, g: 94.5, s: 97
--no_masking Skip masking taxonomy step
--random_state RANDOM_STATE
Random state for subsampling (default 888)
--n_polish N_POLISH Number of polishing rounds (default 3)
--db_path DB_PATH Path to store/existing database (default
$out_dir/$database). Please use only databases,
created by previous NaMeco run to avoid errors
--version Check the version
Example:
conda activate NaMeco
nameco --inp_dir Samples --threads 20
How to import the output into Qiime2:
The output can be exported to qiime2 for further analyses. For example, cluster_count.tsv, rep_seqs.fasta and Taxonomy-q2.tsv can used for alpha/beta diversity analyses (including phylogenetic metrics), while taxa counts can be used for differential abundance tests.
#can be done in your qiime2 env. or in NaMeco
mkdir to_qiime2
#import feature table
biom convert \
-i NaMeco_out/Final_output/cluster_counts.tsv \
-o to_qiime2/table.biom \
--table-type="OTU table" \
--to-hdf5
qiime tools import \
--input-path to_qiime2/table.biom \
--type 'FeatureTable[Frequency]' \
--input-format BIOMV210Format \
--output-path to_qiime2/table.qza
#import taxonomy
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-path NaMeco_out/Final_output/Taxonomy-q2.tsv \
--output-path to_qiime2/taxonomy.qza
#import representative sequences
qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path NaMeco_out/Final_output/rep_seqs.fasta \
--output-path to_qiime2/rep-seq.qza
If you want to import taxa counts, you can either import them directly from NaMeco output, similarly to the cluster_count.tsv, or just collapse the cluster_count table to the desired level within qiime2.
Important:
-
All samples that are compared to each other should be run together in one pool, even from different sequencing runs. Do not merge different NaMeco runs at Cluster level since Cluster IDs would not match. If needed, we recommend to merge different NaMeco runs at taxonomy level.
-
Using multiple threads can significantly speed up the NaMeco run
-
If you are facing issues with the drive space on your working drive, export tmp directory before running NaMeco: "export TMPDIR=/big_storage_path/TMP"
-
"Creating database" can take a long time, just wait until it is done. Good time to make some coffee!
Best wishes and good luck,
Timur