Hi @LivYeh,
There doesn’t appear to be a difference between the QC and non QC files sent:
$ find . -name "*.gz" -exec gunzip {} \;
$ find . -name "*.fastq" -exec md5 {} \;
MD5 (./18S_insilico_mock_merged/18s-mock-EukV4YRR-staggered-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_1_L001_R1_001.fastq) = bd23b0520cfaf8e6f3cd6587826a30c5
MD5 (./18S_insilico_mock_merged/18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fastq) = cdd43f26b682158ddd42e39f222649e8
MD5 (./18S_insilico_mock_merged_QCd/18s-mock-EukV4YRR-staggered-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_1_L001_R1_001.fastq) = bd23b0520cfaf8e6f3cd6587826a30c5
MD5 (./18S_insilico_mock_merged_QCd/18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fastq) = cdd43f26b682158ddd42e39f222649e8
The number of unique reads in the even sample seems to make sense, and those reads are unique over the first 100 nucleotides as well:
$ grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta | sort - | uniq | wc -l
10
$ grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta | sort - | cut -c 1-100 | uniq | wc -l
10
$ grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta | sort - | cut -c 1-100 | uniq -c
10 AGCTCCAAGAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTATAACGCGCCTGGCCCGCTTTTGTGAGTGCCGGTGCGCGT
10 AGCTCCAATAGCGTATACTAACGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGTTAGGGTGAGGCGGCCGGCCACTCGTGGTTGTAGCTTGTTAT
10 AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGGAGGGTGCCGCCGTCCGGCGTGTCCGTGTGCAGTGGCGCCC
10 AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGGAGTGATCGGTCCTTCACTTAGTGTTGGAACCTGATTG
10 AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAGAACGCTCGTAGTCGGATTTCGGGGCGGTCCGACCGGTCTGCCGATGGGTATGCACTGGTCGGAG
10 AGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGCCGAGGACGACCGGTCCGCCCTCTGGGTGTGTATCTGGCTCG
10 AGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGCTGAAGCAAACCGGTCCGCCCTCTGGGTGAGCATCTGGTTTT
10 AGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGTCGGTGAGTGAAAGTCCGCTCTCAGTGGTTGGTACTTTTCAC
10 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGGATCTCAACAGCCTTGAAGCGGTTAACTTTGTAGTTTTACTGCTTTATAG
10 AGCTCCAATAGCGTATGTTAAAGTTGTTGCGGCTAAAAAGCTCGTAGTTGGATTTCTAGATGAGAGGAGTGGTCCACTCTATGAGTGTGTATCTACTTCC
I then converted the input file to FASTA (as the PHRED type detection requires additional FASTQ record information), and ran the even sample through Deblur. Note that I trimmed to 100nt as the above assessment showed the data are unique over the first 100nt.
$ grep -A 1 "^@" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fastq | grep -v "\-\-" | tr "@" ">" > 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta
$ deblur workflow --seqs-fp 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta --output-dir test -t 100
I observe 10 unique Deblur OTUs, which is what I’d expect given the input data.
$ grep -c "^>" test/all.seqs.fa
10
I then reran Deblur using the trim length you had used (380), and observed 9 sequences from Deblur:
$ deblur workflow --seqs-fp 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta --output-dir test-full -t 380
$ grep -c "^>" test-full/all.seqs.fa
9
On closer inspection, not all of the in silico reads are of the same length. The trim length of 380 would omit 10 sequences.
$ for i in $(grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta);
do
echo ${#i};
done | sort | uniq -c
10 373
40 381
20 382
20 383
10 384
The sequences that are of length 373 all the same as well.
$ for i in $(grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta);
do
if [[ ${#i} -eq 373 ]];
then
echo $i;
fi;
done | sort | uniq -c
10 AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGGAGGGTGCCGCCGTCCGGCGTGTCCGTGTGCAGTGGCGCCCTTCCATCCTTCTGTTAGCGTCTCTTGGCATTCATTTGCTGGTGGCGGGCTCAGATATTTTACCTTGAGAAAATTAGAGTGTTTCAGGCAGGCTAGGCCGGAATACATTAGCATGGAATAATGGAATAGGACTACGGTCTCTTTGTTGGTTTGAGGGACTGCAGTAATGATTAATAGGGATAGTTGGGGGCATTAGTATTTAATTGTCAGAGGTGGAATTCTCAGATTTGTTAAAGACTAACTTATGCGAAAGCATTTGCCAAGGATGTTTTCA
In the stats output, the reason there are only 9 dereplicated sequences is because the the filter for sequence length is applied before dereplication. A filter length of 380 would omit these 10 sequences, leaving only 9 unique ones left over. This can be confirmed by rerunning deblur with a trim length of 373:
$ deblur workflow --seqs-fp 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta --output-dir test-full-373 -t 373
$ grep -c "^>" test-full-373/all.seqs.fa
10
Best,
Daniel