Large differences between Qiime2 and BaseSpace sample sequence counts

Hi,

I am new to using Qiime and am trying to work through the pipeline with some 16S V4 paired-end sequencing reads. I have got as far as demultiplexing the data.

Comparing the per-sample sequence counts between those produced by Qiime 2 demultiplexing and those produced by BaseSpace, I noticed that the BaseSpace per sequence counts are far higher with nearly all samples with totals a lot higher than 1000. Whereas, for Qiime2 the vast majority of samples have counts less than 1000.

Therefore, I was wondering what could be causing this discrepancy in the demultiplexing results. Am I doing something wrong in the Qiime 2 command line?

My sequences were produced using the EMP protocol and consists of paired-end reads with a single 12bp barcode on the forward primer

I am running Qiime2-2021.8, installed via Conda.

To import the files I used the following commands:
qiime tools import --type EMPPairedEndSequences
--input-path xxxxxxxxxxx
--output-path xxxxxxxxx/paired-end-sequences.qza

To demultiplex I used these commands:
qiime demux emp-paired
--i-seqs xxxxxxxx/paired-end-sequences.qza
--m-barcodes-file xxxxxxxxxxxxxx/demux_metadata.txt
--m-barcodes-column barcode-sequence
--p-rev-comp-mapping-barcodes
--o-per-sample-sequences xxxxxxxxxxxxx/demux-full.qza
--o-error-correction-details xxxxxxxxxxxxx/demux-details.qza

Thanks for any light you can shed on this issue

Welcome to the QIIME 2 community, @aghudson!

I can't say specifically, since I don't know what methods are being used on Basespace.

I don't see anything obviously wrong - but one question for you (and your sequencing center) is: do you have Golay Barcodes? If so, great! If not, that might be the issue, because by default the emp-paired method in the demux plugin will perform Golay error correction. This is on by default because the original EMP primers are all Golay barcodes, but in case you didn't use EMP primers, then you might need to disable the error correction.

Keep us posted! :qiime2: :t_rex:

Hi,

Apologies for the delay in responding and thanks for the welcome!

Yes, the sequencing center used Golay barcodes. So don't think this is the problem

Well, as I mentioned last time, I can't really comment on what Basespace is doing - can you share some more details with us? Perhaps you can share a download link for you demux-details.qza output in a DM to me? Can you share your basespace report, as well as the viz from qiime demux summarize?

Hi,

I am not sure if this a DM or not. Here is the link to the demux-details.qza output:

I need to contact the sequencing centre for the BaseSpace report as this was not provided with the read data

Hi @aghudson - thanks for sharing!

I think there might be an issue with your barcode sequences - there are many records in here with the barcode sequence NNNNNNNNNNNN. Have you double-checked that you have imported the correct data? As well, it looks like most of your barcodes (that aren't all Ns) are failing to be error-corrected - is the orientation correct? I see you're RCing the mapping barcodes, but is it possible that you need to RC the barcode sequences?

:qiime2:

Hi @thermokarst,

I did have some problems with the Qiime data input step as I wasn't sure which data import 'type' to use (in the end I used 'EMPPairedEndSequences') and I was also unsure as to whether I needed to change my file names to 'forward.fastq.gz', 'reverse.fastq.gz', 'barcodes.fastq.gz'.

However, I checked that import was successful using 'qiime tools peek'.

I did not carry out the sequencing library preparation but was informed they use 12 bp barcodes on the forward primer (based on the EMP protocol).

What are the differences between the mapping barcodes and the barcode sequences? At the moment in the 'barcodes-file', I only have the 12 bp Golay Barcode sequence (alongside other sample experimental metadata) and this is presented 5'->3' as if you were ordering the barcodes, rather than reverse-complemented.

Apologies if this is a pretty dumb error!

Then there might be a problem with the actual sequencing product they delivered to you - those "N"s are a show-stopper, for any tool, not just QIIME 2. This is because the barcode is what allows you to associate an individual sequence to a sample. N is a wildcard nucleotide, so this is saying "hey yeah, this sequence belongs to any ol' sample it wants to," which is going to be an issue.

The mapping barcodes are the ones in your sample metadata file. These say "this barcode belongs to this sample."

The barcode sequences are the ones in your barcodes.fastq.gz file. These say "this sequence has this particular barcode."

One thing I have thought of is that the data I am trying to demultiplex is a subset (maybe a quarter) of a much larger sequencing run. The files I imported into Qiime: 'forward.fastq.gz', 'reverse.fastq.gz', 'barcodes.fastq.gz' all contain the full complement of reads from that sequencing run, including those I am not interested in demultiplexing.

However, the metadata file I used, only has barcodes for the specific subset of samples I want to use. I wonder if this is what is causing the discrepancy between the BaseSpace and Qiime demultiplexing results?

Thanks again for the help and patience

Without seeing what it is that you're comparing against on basespace, we can't say. I strongly suggest you chat with your sequencing center to get help understanding the data they have provided you - we would just be making random guesses on our end. I have provided a few questions above - you can start with those to form a strong understanding of these data.

Keep us posted!