Where do the per-sample barcodes come from in sample-metadata.tsv?

bpscherer · September 14, 2020, 2:11pm

Okay, so I've gotten a little further but am getting some confusing results.

I used @LuSanto 's method for demultiplexing Dual-Indexed barcodes, but when I go all the way through importing and creating a visualization, my dataset appears to be missing an enormous # of reads.

This screenshot is from my original dataset which was demultiplexed using the Illumina workflow.

This second screenshot is from the dataset I just built using the cutadapt method.

When running the cutadapt plugin you get output files of "untrimmed sequences," which in my case are much much larger than the "sample-sequence" outputs. The documentation for cutadapt says that these files are the sequences which it couldn't match to any barcodes. In my case it makes sense that these files are large, given the method of generating them. To generate them I ran cutadapt 24 times, with each run only working on 12 samples. Each run would only associate barcodes with a subset of the data, so each run should have a lot of sequences it couldn't associate with a barcode. However, I don't know if there was some additional loss of data for some reason in this process.

That said, I'm still concerned about why there is such a large discrepancy between the two datasets. My original Illumina dataset had a ton of reads which it could not associate with a barcode and pooled as "undefined." This undefined sample had more reads than any other sample. I had originally thought that this was a big issue, but now I'm not sure I can get around that.

If anyone has any help at all I would be extremely appreciative!

Brendan

cutadapt-imported.qzv (312.0 KB)
Illumina-demux.qzv (313.4 KB)