Unusual loss of reads after demultiplexing: Barcodes not matching between R1 and R2

Dear Community,

I am writing this to post to evaluate if our problem can be solved with qiime2 or if its a lab- or provider problem.

We are facing the problem that our Illumina sequence data severely suffers from barcode mismatches between R1 and R2. We usually deploy a 2 x 250 HiSeq protocol targeting a 300 bp region of the 16S gene or ribosome.

In our current DADA-2 pipeline, demultiplexing occurs by checking both barcodes in R1 and R2 by looping cutadapt through the barcodes with this parameters:

cutadapt -j 30 -O 8 --no-indels -e 0 -g ^Barcode1 -G ^Barcode1 --discard-untrimmed

However, in most cases the barcodes differ between corresponding reads in R1 and R2. This step results in a read loss of somewhere between 50% and 95% of all reads. Meaning that in a 10 million read project we – at the very minimum - lose more than 5 million observations just by demultiplexing.

We checked if our barcode library has a contamination problem by testing if there are systematic errors (e.g. overrepresentation of certain barcodes, or if barcodes from neighboring box locations affect each other more frequently) but failed to see any.

The problem is so severe that the scientific interpretation drastically changes between using the merged paired end reads and using only the single reads for analysis. Interestingly there is a good agreement between using R1 and R2 alone, but both differ greatly from the merged reads. This is likely the result of the read loss and the subsequent loss of many rare reads, and thus ASVs.

Our provider insists that there is no error on their side.

My questions:

  1. Do you also experience strong mismatches between barcodes of corresponding reads in R1 and R2?

  2. Do you have any advice on resolving this problem? Can this be further evaluated or even fixed with QIIME2?

Hi @nouse1234,

Can I ask more information on the libraries? More in specific, who did the libraries, your lab or your sequencing provider?
I asking because, in my experience, dealing with read pairs in which read 1 has a different barcode than read 2 is the most common situation. This is related on the way the library prep is designed and this dual-barcode index design is the current way to squash as many samples as possible into a single lane of Illumina. If your provider did the libraries, you should have received a sheet with the demultiplexing scheme for your samples. If you did the libraries, I am sorry for starting from the basic, but is just to exclude the most easy-to-fix situation!
If read1 and read2 are definitely expected to have the same barcode (a different for each sample), than I would look at the files by opening with a text-editor to see what sequences are in read1 and read2, to investigate in what way they differ (if you have scripting language experience you could use the grep command (or PERL/Python scripts) to see how many times the Barcode1 is present in read 1 file and in read2 file), and maybe shortening their sequences to see if they are present as some kind of variations.

Could you please send few lines for your demultiplexing spreadsheet, so we can have a look? Feel free to send as private message if you prefer!

Let us know any progress, if I can thing other possible test or pitfall I will follow on!

EDIT: could you also send some info on how many samples you retrieve searching for the barcodes
in R1, R2 and on both? How many samples are suppose to be in the dataset?

Good luck
Luca

4 Likes

Dear Luca, thanks for the quick response. it is well appreciated.

We are preparing the early steps of library construction ourselves in terms of performing PCRs with the barcoded forward and reverse primers, purification and pooling the library in equimolar amounts. This pooled library is then send to the company where they perform the additional steps of adding sequencing adapters etc. We are currently waiting for the information, how exactly they do these additional steps.
We receive from the company two large fastq files for forward and reverse read, which we then demultiplex ourselves.
We have max. 200 samples in each library, 100 targeting archaea and 100 targeting bacteria. They have all 200 individual barcodes (same barcode for forward and reverse primer), but archaea and bacteria amplicons also differ in the used primers.
We looked at the raw files by hand and also systematically.
Here () you can find a sheet with information from demultiplexing on the example of one library:
This includes the counts of the individual barcodes which were found in Read1 and Read2 and how many actually passed the requirement if having this same barcode on both reads of the pair ("Pairs written") and their percentage.

We also demultiplexed only either by Read1 or by Read2 barcode and then checked which barcode sequence was on the paired read. You can find the results in two assorted sheets. We observed that the matching barcode was the most frequent one, but the sum of all other barcodes together was four times as much.

We also tested demultiplexing with only 7 bp forced overlap of the barcode, but this resulted in no difference. Some barcodes only differ in 2 bp, therefore, we did not reduce the overlap to 6 bp.
We also used the same code to demultiplex older sequencing libraries of us, which were prepared in the same way but sequenced with a different company. But for these, the paired end demultiplexing worked very well and 60-90% of reads with the specific barcode also had this barcode in both reads of the pair (link to other document). Therefore, we think it is less likely that the phenomenon is due failure to our part of the library preparation. We have multiple affected libraries, and they were prepared by multiple people and with different stock solutions, but all have the same problem.

1 Like

Dear @nouse1234,
thanks for your extended description, I see why you expecting each sample being identified by a specific barcode!
We had an internal chat among moderators, @colinbrislawn suggested another fairly easy point of failure,
are order of each read pair the same between read1 and read2 file? Cutadapt and most of the demultiplexing tool expect the same order in the two files? Could you just check for this?
Did you pre-processed the sequence before demultiplexing them?
Do you know if sequencing adapter have been removed? either by you or the facilities?
The command you used expect the barcode to be at the very beginning of the sequence, is it possible that it is not found because there are other sequences before it?
Let us know what the facility answer!

If I collect any other though I will let you know.
Luca

1 Like

Hi Luca,
thanks for your super fast response!
The order in both files is the same. We could for example use bbmerge to merge every read pair in order, regardless of barcode identity, and get about 80 - 90% merging efficiency at zero mismatches allowed (but it is also a low complexity environment).

We did not do any pre-processing before demultiplexing. It is the first step in our pipeline.
The removal of sequencing adapters is done by the facility, not us.
The way we construct our amplicons is barcode (8 bases) -> pcr primer. The reads we checked manually had the barcode at the beginning and this is also what we expect from the data we receive.

But we could check if it is further along in the sequence. Could you point me to a modification of the command which allows me to search for it in a reasonable section of the beginning of the sequence?

Thank you for your support!

PS: I just got notified that with current Illumina devices the barcodes need at least three mismatching bases. Is that true? the minimum distance in our barcode set is two bases.

Hi @nouse1234,

To not force the barcode to be at the beginning of your sequence you just remove the
'^' sign form the barcode.
I am just looking again your command, why you are using the '-O 8' option?
My understanding is that cutadapt expect to find an unknown sequences of '5 bp' before the barcode (where 3 is the default minimum overlap with your barcode sequence), could you do a test omitting this option?
Luca

1 Like

We are using the -O 8 to find the exact 8 bases of the barcode, anchored to the beginning of the sequence (as coded with ^). We also used -O 7, but to no effect.
How can i tell cutadapt to search for the barcode within the first 10 to 15 positions of the sequence?

After looking further into the literature, our problem is known as "index" or "barcode hopping", and depending on the platform, up to 2% of mismatched barcodes can be expect, according to Illumina.
However, we have up to 96%, and on average ~75% of index hopping.
After checking older HiSeq libraries done with another sequencing company, we also experienced some hopping, but much less (~15% max).
At this point, i cannot deduce if the problem is in our pipeline or within the data generation.

Hi @nouse1234,

Did you try to remove the anchoring options?
Or using '-b' and '-B' instead '-g' and '-G'? Which should look for adapter anywhere in the sequences according to the docs.

I suppose we can try to exclude any problem in the pipeline and see what is left!
I can not totally exclude particularly bad cases of barcode-swapping, but in my experience I never seen
such a extreme case, I would tend to exclude until it is the last on the list.
However, another thing which can affect library production is the quality of the barcode-synthesis,
did you order the barcodes form a trusted provider and as purified sequences?
I such a case I would expect more 'unfinished' adapter sequences that the situation what you are seeing but just worth to ask!
Luca