Missing samples (barcode found in the indexing read fastq file) after demultiplexing

Dear Qiime2 developers,

It is a 16S MiSeq run, 300bp paired-end. I have done the analyses before but I only got problems for this run and could not figure out why it is happening.

I was following the tutorial to import the I1 and R1 read fastq files by:

qiime tools import   \
    --type EMPSingleEndSequences  \
    --input-path emp-single-end-sequences  \
    --output-path emp-single-end-sequences.qza

The meta-data mapping file CIMP_RO1.keemei.tsv (21.8 KB)
was validated by Keemei. I run the commands (tried both reverse complimentary and non-reverse complimentary options)

qiime demux emp-single \
    --i-seqs emp-single-end-sequences.qza \
    --m-barcodes-file CIMP_RO1.keemei.tsv \
    --m-barcodes-column BarcodeSequence  \
    --o-per-sample-sequences demux.qza \
    --p-rev-comp-barcodes
qiime demux summarize \
    --i-data demux.qza \
    --o-visualization demux.qzv

demux.qzv (286.1 KB)

As we could tell from the demux.qzv, quite some samples in the meta-data mapping file are missing. All of them are from the bottom of the mapping file (fact: the missing ones are newly ordered barcodes, all the barcodes (former and new ones) are suggested from the EMP protocol). BTW, I did demultiplexing in Qiime1 and get exact the same output as from Qiime2.

We thought it may caused by the new barcode/index that the some how the new missing barcode related R1/R2 reads have low quality with did not pass the demultiplex filtration(?). So I checked the index read fastq, and extract the sequences ID of reads with barcode/index from those missing samples. I did find a significant number (~200,000) for this barcodes, therefore I pull out the reads from R1 and R2. I put those missing sample R1/R2 reads for FastQC and they turned out to be very good quality as other reads.

I am sort of confused that I find those missing samples in the index/R1/R2 file but could not be demultiplexed. I don't find any relevant post in the forum either. Could you please help to see if there is a way we could still savage the data for those missing samples?

Thank you.
Cheng

Hey there @gc26762524! Would you be able to share your multiplexed seqs artifact with us? That might make diagnosis a bit easier (feel free to send a link in a private message to me).

Otherwise, I am not quite sure - it sounds like you are observing your barcode reads visibly in your index file, but for some reason they aren’t being extracted out? Since you mentioned these are MiSeq reads, have you tried using Illumina’s tools for demultiplexing? If so, did you have similar results?

Keep us posted. :t_rex: :qiime2:

Hi Matthew,

Thanks for the suggestions. I am here sharing you with the .qza file via google drive. You may need to get a permission from me to access the file. Please see the link below:
https://drive.google.com/open?id=1idDzY0WVFGIRcZ_3WOqa5Cmyvxs1UF2O

In the meanwhile, we will try to demultiplex the file from MiSeq itself to see. Thanks again for looking into the issue.

Cheng

Thanks @gc26762524 - I don’t see the file I requested - can you point me in the right direction?

Thanks!

HI @thermokarst, Thanks for working on it. I misunderstood you. Now I put qza files before and after demux in the google drive directory. Best regards,
Cheng.

Hmm - I am seeing a handful of samples from all over the metadata file - 26 sample missing in all.

I searched in the barcodes.fastq.gz file for a few of the barcodes for the missing samples (and their reverse complement) and didn’t get a single hit (this is with manual searching…).

Any chance this is a clerical issue?

Currently, EMP-based demuxing doesn’t support any form of error correction - if your reads still had the barcodes in the read, you could use q2-cutadapt to demux. Otherwise, you could use an external tool to demux, then import the demuxed reads into QIIME 2.

Keep us posted! :qiime2: :t_rex:

HI @thermokarst and @gc26762524. First post so i apologize if im doing anything wrong here. I am having the same problems i believe. To summarize, i have illumina paired-end data, multiplexed in 3 files (R1, R2, R3 (index)). When demultiplexing with demux emp-paired, i lose a lot of my samples (21 or so). Interestingly, when i grep and manually search the index file, these missing samples have TONS of hits in the index fastq.

My question is: why would there be tons of barcodes in the barcodes.fastq (R3) file for a sample that does not get demultiplex (ie, i lose it)? I would love to get those samples back if i could haha. Heres my command:
qiime demux emp-paired
–m-barcodes-file qiime2_sample_mani.txt
–m-barcodes-column BarcodeSequence
–i-seqs emp-paired-end-sequences.qza
–o-per-sample-sequences demux

Here are all kinds of files! If youre part of the support group ill gladly grant the access on them: https://drive.google.com/drive/folders/1FqypLtTgpVpvhK-rigPK6g6StrfDZvYy?usp=sharing

PS. I got the same results using illumina’s demultiplexer, Qiime1 and Qiime2. Ill put the qiime1 split_library_log.txt up there too, it shows almost 7 million unassigned reads.

I figured out my problem that my case is actually a stupid mistake. Thanks to @thermokarst help, that I figured out the file I was using is wrong. Story: I renamed wrong files (from previous other run) in the emp-single-end-sequences directory for Qiime2 emp-single-end-sequences.qza generation (barcodes.fastq.gz & sequence.fastq.gz), while I am checking the correct Index file for the barcode, hence I found the barcodes (of course) but used the wrong emp-single-end-sequences.qza file for Qiime2 pipeline. I don’t know if that is your case, but still hope that will help.

Cheng

1 Like

Thanks for sharing your data, @aoliver2! I whipped up a quick check, to give us an idea of barcode counts (note, ag is just like grep, only a lot faster):

while read p; do
        q=$(echo $p | tr -d '[:space:]')
        count=$(ag --no-color -c $q barcodes.fastq)
        echo $q $count
done < <(awk -F '\t' '{print $2}' barcodes.tsv)

And the results:

barcode      count
CTACAGGGTCTC 200349
CTTGGAGGCTTA 118245
TATCATATTACG 71236
CTATATTATCCG 94224
ACCGAACAATCC 4
ACGGTACCCTAC 62077
TGAGTCATTGAG 6
ACCTACTTGTCT 3
ACTGTGACGTCC 15903
CTCTGAGGTAAC 26708
CATGTCTTCCAT
AACAGTAAACAA
AGTAAAGATCGT 64519
TTGCTGGACGCT 76382
TTGCGGACCCTA 56385
CGGTATAGCAAT 59673
TATGGTACCCAG 58594
ACGTGAGGAACG 63213
TAGTTCGGTGAC 2
TTAATGGATCGG
TCAAGTCCGCAC 24290
CACACAAAGTCA 32920
GTCAGGTGCGGC
TTGAACAAGCCA 4
GTCGTCCAAATG 83673
GACTCTGCTCAG 83018
AGCCCTGCTACA 54861
ACTCGCTCGCTG 53119
CTGTCTATACTA 87869
TAATCTCGCCGG 65106
GTTCATTAAACT 124819
GTGCCGGCCGAC 81942
CCTTGACCGATG 76283
CAAACTGCGTTG 84376
TCGAGAGTTTGC 3
CGACACGGAGAA 80149
TCCACAGGGTTC 1
GGAGAACGACAC 1
CCTACCATTGTT 22186
TCCGGCGGGCAA 34256
TAATCCATAATC 1
CCTCCGTCATGG 1248
CTGACACGAATA 68661
GCTGCCCACCTA 1
GCGTTTGCTAGC 61
TTCGATGCCGCA 40285
AGAGGGTGATCG 41265
AGCTCTAGAAAC 79025
AGATCGTGCCTA 6
AATTAATATGTA 1
CATTTCGCACTT 23119
ACATGATATTCT 40070
GCAACGAACGAG
AGATGTCCGTCA
TCGTTATTCAGT 98252
GGATACTCGCAT 81071
AATGTTCAACTT 89942
AGCAGTGCGGTG 76650
CCGGCGACAGAA 72356
CCTCACTAGCGA 1
CTAATCAGAGTG 1
CTACTCCACGAG 24066
TAAGGCATCGCT 37283
AGCGCGGCGAAT
TAGCAGTTGCGT
ACTCTGTAATTA 84625
TCATGGCCTCCG 71820
CAATCATAGGTG 48563
GTTGGACGAAGG 71218
GTCACTCCGAAC 4
CGTTCTGGTGGT 372351
ATATGTTCTCAA 98506
ATGTGCTGCTCG 92404
CCGATAAAGGTT 72620
CAGGAACCAGGA 88840
GCATAAACGACT 3
ATCGTAGTGGTC 5
ACTAAAGCAAAC 1
GTCCGTCCTGGT 26152
CGAGGCGAGTCA 22350
TTCCAATACTCA 1
AACTCAATAGCG
TCAGACCAACTG 82002
CCACGAGCAGGC 82326
GCGTGCCCGGCC 75520
CAAAGGAGCCCG 63618
TGCGGCGTCAGG 7
CGCTGTGGATTA
CTTGCTCATAAT 5
ACGACAACGGGC 1
CTAGCGTGCGTT 27575
TAGTCTAAGGGT 30302
GTTTGAAACACG
ACCTCAGTCAAG
TCATTAGCGTGG 143886
CGCCGTACTTGC 102116
TAAACCTGGACA 88547
CCAACCCAGATC 77335
TTAAGTTAAGTT 11
AGCCGCGGGTCC
GGTAGTTCATAG
CGATGAATATCG
GTTCTAAGGTGA 20167
ATGACTAAGATG 22809
ACATACTGAGCA 62824
AGCCTTCGTCGC 127012
CGTATAAATGCG 114618
TGACTAATGGCC 70435
GTGGAGTCTCAT 157371
TGATGTGCTAAG 75152
TGTGCACGCCAT 96751
GGTGAGCAAGCA 72412
CTATGTATTAGT 53403
CCTAACGGTCCA 71314
TTCCTTAGTAGT 75445
TATGCCAGAGAT 62242
ATCTAGTGGCAA 3916
TCCATACCGGAA 63748
ATGCTGCAACAC 57677
CGGGACACCCGA 130975
ACCTTACACCTT 65191
GTAGTAGACCAT 72626
CCGGACAAGAAG 62961
TAAATATACCCT 66074
ACTCCCGTGTGA 52514
CTCGCCCTCGCC 45760
TACTAACGCGGT 77141
TGACAGAATCCA
TACGGATTATGG 69381
TACAGCGCATAC
AAGAACTCATGA 73102
GACTCAACCAGT 68115
GCCTCTACGTCG 60258
GCCTATGAGATC 85650
CAAGTGAAGGGA 80668
AATGCGCGTATA 66214
CTTGATTCTTGA 76338
AATCTTGCGCCG 69347
AGGATCAGGGAA 81010
TGAGACCCTACA 54956
ACTTGGTGTAAG 73340
TTACACAAAGGC 70347
ACGACGCATTTG 72488
CTTATTAAACGT 166113
GCTCGAAGATTC 65830
GAACCAGTACTC 78085
CGCACCCATACA 46359
GCATATGCACTG 3
TAGGAACTCACC 3

Hmm, I count 17 samples that are completely unobserved in this list of barcodes, plus another sizeable chunk of single-digits counts. Where are you getting 21 from?

Strange! Maybe this is a clerical issue? At least as far as the data you sent me goes, that brute-force search just yielded pretty similar results to what you experienced in q2-demux, although it doesnt sound consistent with your manual searches! I think I just did the same basic thing here as you did in your manual searches and didn’t find the same results as you did. Maybe I am looking at the wrong file? Maybe your grep command did some fuzzy matching? What do you think?

Keep us posted! :t_rex: :qiime2:

2 Likes

Hi @thermokarst!

You are a champion, i dig that ag tool, definitely going to get that going. Thanks a ton for transparency and help.

You are totally right and your numbers match up really nicely with our demultiplexing data.

My best guess is a horrible file conversion problem. I converted the barcodes.fq to a fasta and then grep-ed for the barcodes in that fasta. Going back to the raw fastq…i got the same answer you got. Cant explain it other than a wonky perl script for the conversion.

BIG THANKS AGAIN.

Best,

Andrew

2 Likes

Sound’s like my kind of perl script!

Glad to be of service! Keep on QIIMEin’! :qiime2: :qiime2: :qiime2:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.