DADA2 Duplicate Sample IDs

adinasarapu · May 5, 2017, 5:57pm

I have similar problem at Dada2 step ("Duplicate sample IDs!") of QIIME2 pipeline. My manifest file has no duplicate entries?

Start - Mon May  1 09:26:08 EDT 2017
  2 R version 3.3.1 (2016-06-21)
  3 Loading required package: Rcpp
  4 Warning messages:
  5 1: multiple methods tables found for ‘arbind’
  6 2: multiple methods tables found for ‘acbind’
  7 3: replacing previous import ‘IRanges::arbind’ by ‘SummarizedExperiment::arbind’ when loading ‘GenomicAlignments’
  8 4: replacing previous import ‘IRanges::acbind’ by ‘SummarizedExperiment::acbind’ when loading ‘GenomicAlignments’
  9 5: multiple methods tables found for ‘left’
 10 6: multiple methods tables found for ‘right’
 11 DADA2 R package version: 1.1.7
 12 1) Filtering ......................................................................................................................................................    ...................................................................................................................................................................    ...................................................................................................................................................................    ...................................................................................................................................................................    ........................................................
 13 2) Learning Error Rates
 14 2a) Forward Reads
 15 Initial error matrix unspecified. Error rates will be initialized to the maximum possible estimate from this data.
 16 Initializing error rates to maximum possible estimate.
 17 Sample 1 - 39375 reads in 7338 unique sequences.
 18 Sample 2 - 67127 reads in 11819 unique sequences.
 19 Sample 3 - 49275 reads in 17656 unique sequences.
 20 Sample 4 - 111841 reads in 18908 unique sequences.
 21 Sample 5 - 66061 reads in 13826 unique sequences.
 22 Sample 6 - 97741 reads in 20705 unique sequences.
 23 Sample 7 - 86889 reads in 14801 unique sequences.
 24 Sample 8 - 94258 reads in 17576 unique sequences.
 25 Sample 9 - 84723 reads in 19152 unique sequences.
 26 Sample 10 - 78928 reads in 11940 unique sequences.
 27 Sample 11 - 94925 reads in 19716 unique sequences.
 28 Sample 12 - 247547 reads in 53555 unique sequences.
 29    selfConsist step 2
 30    selfConsist step 3
 31    selfConsist step 4
 32    selfConsist step 5
 33 
 34 
 35 Convergence after  5  rounds.
 36 2b) Reverse Reads
 37 Initial error matrix unspecified. Error rates will be initialized to the maximum possible estimate from this data.
 38 Initializing error rates to maximum possible estimate.
 39 Sample 1 - 39375 reads in 11958 unique sequences.
 40 Sample 2 - 67127 reads in 22933 unique sequences.
 41 Sample 3 - 49275 reads in 17223 unique sequences.
 42 Sample 4 - 111841 reads in 33476 unique sequences.
 43 Sample 5 - 66061 reads in 19661 unique sequences.
Sample 6 - 97741 reads in 36454 unique sequences.
 45 Sample 7 - 86889 reads in 24370 unique sequences.
 46 Sample 8 - 94258 reads in 33442 unique sequences.
 47 Sample 9 - 84723 reads in 32011 unique sequences.
 48 Sample 10 - 78928 reads in 21581 unique sequences.
 49 Sample 11 - 94925 reads in 34709 unique sequences.
 50 Sample 12 - 247547 reads in 76285 unique sequences.
 51    selfConsist step 2
 52    selfConsist step 3
 53    selfConsist step 4
 54    selfConsist step 5
 55 
 56 
 57 Convergence after  5  rounds.
 58 
 59 3) Denoise remaining samples ......................................................................................................................................    ...................................................................................................................................................................    ...................................................................................................................................................................    ...................................................................................................................................................................    ............................................................
 60 The sequences being tabled vary in length.
 61 4) Remove chimeras (method = pooled)
 62 5) Write output
 63 Traceback (most recent call last):
 64   File "/home/adinasarapu/anaconda2/envs/qiime2-2017.4/lib/python3.5/site-packages/q2cli/commands.py", line 218, in __call__
 65     results = action(**arguments)
 66   File "<decorator-gen-241>", line 2, in denoise_paired
 67   File "/home/adinasarapu/anaconda2/envs/qiime2-2017.4/lib/python3.5/site-packages/qiime2/sdk/action.py", line 171, in callable_wrapper
 68     output_types, provenance)
 69   File "/home/adinasarapu/anaconda2/envs/qiime2-2017.4/lib/python3.5/site-packages/qiime2/sdk/action.py", line 248, in _callable_executor_
 70     output_views = callable(**view_args)
 71   File "/home/adinasarapu/anaconda2/envs/qiime2-2017.4/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 174, in denoise_paired
 72     return _denoise_helper(biom_fp, hashed_feature_ids)
 73   File "/home/adinasarapu/anaconda2/envs/qiime2-2017.4/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 77, in _denoise_helper
 74     table.update_ids(sid_map, axis='sample', inplace=True)
 75   File "/home/adinasarapu/anaconda2/envs/qiime2-2017.4/lib/python3.5/site-packages/biom_format-2.1.5-py3.5-linux-x86_64.egg/biom/table.py", line 1069, in update_id    s
 76     errcheck(result)
 77   File "/home/adinasarapu/anaconda2/envs/qiime2-2017.4/lib/python3.5/site-packages/biom_format-2.1.5-py3.5-linux-x86_64.egg/biom/err.py", line 472, in errcheck
 78     raise ret
 79 biom.exception.TableException: Duplicate sample IDs!
 80 
 81 Plugin error from dada2:
 82 
 83   Duplicate sample IDs!
 84 
 85 See above for debug info.
 86 Running external command line application(s). This may print messages to stdout and/or stderr.
 87 The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
 88 
 89 Command: run_dada_paired.R /tmp/tmp.YaC65WXhrx/tmpaz1m468_/forward /tmp/tmp.YaC65WXhrx/tmpaz1m468_/reverse /tmp/tmp.YaC65WXhrx/tmpaz1m468_/output.tsv.biom /tmp/tmp    .YaC65WXhrx/tmpaz1m468_/filt_f /tmp/tmp.YaC65WXhrx/tmpaz1m468_/filt_r 200 200 20 20 2.0 2 pooled 1.0 60 1000000

ebolyen · May 5, 2017, 6:34pm

Hi @adinasarapu,

Unfortunately there's not a lot to go on here, biom seems to feel that you have duplicate sample IDs so I would double-check your manifest. dada2 doesn't particularly care one-way or the other.

Could you provide the manifest file? Thanks!

adinasarapu · May 8, 2017, 8:23pm

Sorry for the delay.

My manifest file looks like this

sample-id,absolute-filepath,direction
EIGC9-001-2,$HOME/microbiome/PTB_0816/EIGC9-001-2_S37_L001_R1_001.fastq.gz,forward
EIGC9-001-2,$HOME/microbiome/PTB_0816/EIGC9-001-2_S37_L001_R2_001.fastq.gz,reverse
EIGC9-002-2,$HOME/microbiome/PTB_0816/EIGC9-002-2_S38_L001_R1_001.fastq.gz,forward
EIGC9-002-2,$HOME/microbiome/PTB_0816/EIGC9-002-2_S38_L001_R2_001.fastq.gz,reverse
EIGC9-003-2,$HOME/microbiome/PTB_0816/EIGC9-003-2_S39_L001_R1_001.fastq.gz,forward
EIGC9-003-2,$HOME/microbiome/PTB_0816/EIGC9-003-2_S39_L001_R2_001.fastq.gz,reverse
EIGC9-004-2,$HOME/microbiome/PTB_0816/EIGC9-004-2_S40_L001_R1_001.fastq.gz,forward
EIGC9-004-2,$HOME/microbiome/PTB_0816/EIGC9-004-2_S40_L001_R2_001.fastq.gz,reverse
...
...
...

ebolyen · May 8, 2017, 8:50pm

That all looks pretty good I don't see why those sample-ids would pose a problem for anything. Furthermore, the transformer should be validating these properties. So we should have seen an error on import if anything was really wrong.

Could you provide the exact import command you used? Alternatively if you've lost it or don't know what it was, you can use the provenance tab in q2view to figure out what your command was using your imported .qza, i.e. the input you gave to denoise-paired.

I'm still not entirely sure what's wrong, but it is possible there is something wrong with our validation. We can use a couple of bash commands to test some things about your manifest:

tail -n+2 YOUR_MANIFEST_HERE.txt | cut -f 1 -d ',' | uniq | wc -l

This will read your manifest (skipping the first line), then take the first column, then collapse every line that is identical to an adjacent line, and finally counts the number of remaining lines. This should be exactly the number of samples you have.

The next thing we want to look at is the second column:

tail -n+2 YOUR_MANIFEST_HERE.txt | cut -f 2 -d ',' | sort | uniq | wc -l

This does the same thing, but it looks at the second column (cut -f 2) and will sort the lines so that identical lines will be adjacent, which uniq will again collapse before we count them. Since every sample has a forward and reverse read, we expect exactly twice the number of samples.

Finally the third column should be alternating between forward and reverse (this might be the problem in your case):

tail -n+2 YOUR_MANIFEST_HERE.txt | cut -f 3 -d ',' | uniq | wc -l

Same as the first command, but this time we expect every line to be different from its adjacent lines so we should see exactly twice the number of samples.

Hopefully one of these steps will uncover more information.

adinasarapu · May 9, 2017, 3:05pm

My manifest file looks good (with your bash commands). there are no duplicates. I have still an error (previously described) at STEP 3.

Here is my script (to run at SGE cluster)

#!/bin/sh

echo "Start - `date`" 
#$ -N QIIME.2
#$ -q all.q
#$ -l h_rt=220:00:00
#$ -pe smp 60
#$ -cwd
#$ -j y
#$ -m abe
#$ -M xxxx@emory.edu

# module load Anaconda3/4.2.0

source activate qiime2-2017.4

PROJ_DIR=/home/xxxx/microbiome
QIIME2_DIR=${PROJ_DIR}/qiime2
OUT_DIR=${QIIME2_DIR}/qiime2_xxx

if [ ! -d ${OUT_DIR} ]; then
	mkdir -p ${OUT_DIR}
fi

export TMPDIR=${PROJ_DIR}/tmp

# create a unique folder on the local compute drive
if [ -e /bin/mktemp ]; then
	TMP_DIR=`/bin/mktemp -d -p ${TMPDIR}/` || exit
elif [ -e /usr/bin/mktemp ]; then
	TMP_DIR=`/usr/bin/mktemp -d –p ${TMPDIR}/` || exit
else
	echo “Error. Cannot find mktemp to create tmp directory”
	exit
fi

cp ${OUT_DIR}/xxx_manifest ${TMP_DIR}/

#########
# STEP 1 #
#########

# Imports FASTQ files into QIIME2 object
qiime tools import \
	--type SampleData[PairedEndSequencesWithQuality] \
	--input-path ${TMP_DIR}/xxx_manifest \
	--output-path ${TMP_DIR}/paired-end-xxx.qza \
	--source-format PairedEndFastqManifestPhred33

cp ${TMP_DIR}/paired-end-xxx.qza ${OUT_DIR}/paired-end-xxx.qza 

#########
# STEP 2 #
#########

# Plot positional qualitites
qiime dada2 plot-qualities \
	--i-demultiplexed-seqs ${TMP_DIR}/paired-end-xxx.qza \
	--o-visualization ${TMP_DIR}/xxx-qualities.qzv \
	--p-n 10

cp ${TMP_DIR}/xxx-qualities.qzv ${OUT_DIR}/xxx-qualities.qzv

/bin/rm ${TMP_DIR}/xxx-qualities.qzv

#########
# STEP 3 #
#########

qiime dada2 denoise-paired \
    	--i-demultiplexed-seqs ${TMP_DIR}/paired-end-xxx.qza \
    	--output-dir ${TMP_DIR}/table-dada2 \
    	--p-n-threads 60 \
    	--p-chimera-method pooled \
    	--p-trim-left-f 20 \
    	--p-trim-left-r 20 \
    	--p-trunc-len-f 200 \
    	--p-trunc-len-r 200 \
    	--verbose

rsync -av ${TMP_DIR}/ ${OUT_DIR}

/bin/rm -rf ${TMP_DIR}

source deactivate qiime2-2017.4

# module unload Anaconda3/4.2.0
echo "Finish - `date`"

ebolyen · May 9, 2017, 5:29pm

That's a lovely script @adinasarapu, thanks for sharing!

I think we've found a bug in QIIME 2, but I would like to confirm a couple things. Do all of your sample-IDs use dashes (-), or do any of them use underscores (_)? If one of the sample-IDs had an underscore in the name, I think our code will take the first chunk it sees, and based on your naming scheme, we'll end up without the latter half of the ID that makes it unique. It looks like we already have an issue on this, which we need to fix!

If none of your sample-ids use an underscore then I'll probably need to create a "debug" build of q2-dada2 for you to install, but I'd like to confirm the IDs before going through that.

Thanks so much for your assistance and patience with this!

adinasarapu · May 9, 2017, 6:19pm

Some of my sample IDs have "_".

MZW20791-1171.8.1_E0586-1-RecM1 .../MZW20791-1171.8.1_E0586-1-RecM1_S76_L001_R1_001.fastq.gz forward
MZW20791-1171.8.1_E0586-1-RecM1 .../PTB_0816/MZW20791-1171.8.1_E0586-1-RecM1_S76_L001_R2_001.fastq.gz reverse

I will replace "_" with "-".

ebolyen · May 9, 2017, 8:17pm

Perfect!

I've just submitted a pull request which should fix this issue. So ideally this won't be a problem in the next release!

adinasarapu · May 13, 2017, 10:05pm

It worked, after sample name correction!

system · June 14, 2017, 4:10am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.