Problem with cutadapt demultiplexing of IonTorrent data

Stratering · July 10, 2020, 4:21pm

Hello QIIME2 team,

we are working with IonTorrent sequencing data and using the “qiime cutadapt demux-single” command to demultiplex our samples (qiime cutadapt demux-single --i-seqs multiplexed-seqs.qza–m-barcodes-file metadata-file.tsv --m-barcodes-column BarcodeSequence --p-error-rate 0 --p-minimum-length xxx --o-per-sample-sequences demux.qza --o-untrimmed-sequences untrimmed.qza). The barcodes have a variable length between 10-12 nt. QIIME version 2020.6 and 2019.10 were used on Linux (Ubuntu) and Mac (10.15.5).

Last time we sequenced on one chip 16S rRNA gene (65 samples) sequences and 18S rRNA (7 samples) gene sequences. The 18S rRNA gene approach we did for the first time and the quality and the length of the output of the sequencer was not very good. To our surprise, all long and good quality sequences belonged to bacteria, not to eukarya. First, we thought it is a problem of the primer specificity and of the PCR conditions but when we looked in the demultiplexed files we found that all bacterial sequences were missing the eukarya forward primer, and instead we could find our universal 16S rRNA gene forward primer. Not directly at the 5´ ends as for the eukarya primer but a few bp (4-6 bp) later.
We repeat the demultiplexing with the QIIME1 demultiplexing command “demultiplex_fasta.py” and also using the “demultiplexing” tool from Jeroen F.J. Laros. With both tools, the bacteria sequences and primer sequences disappear.

Now we are afraid that “cutadapt” also missorted barcodes in the 16S rRNA gene data we performed before. We demultiplexed older runs with QIIME2, QIIME1, and the tool from J. Laros. Whereas QIIME1 and the Laros tool gave a comparable amount of sequences per sample, QIIME2 always put out more sequences per sample. Mostly a few hundred to a few thousand but sometimes also > 10000 sequences differences. We also checked the original amount of barcodes sequences in the sequencer file before demultiplexing with “grep” and found numbers close to the QIIME1 or Laros demultiplexing tool. Here a few examples (tool:sequences/samples) e.g. QIIME1:9026, Laros tool:9070, QIIME2 cutadapt: 9956, grep 8967 or QIIME1:85044, grep:85080, QIIME2 cutadapt:96560 or QIIME1:11882, Laros tool: 11857, grep: 11873, QIIME2 cutadapt:13223.
We also checked if it is a problem of the variable barcode length and added to all barcode smaller as 12 one to two bp of the “linker” sequence. Again, we find bacteria sequences in the eukarya sequences. Sorting out the 16S rRNA gene samples by only giving the barcodes of the 18S rRNA gene in the mapping file in the demultiplexing step make it much worse. Sometimes we got 4 times more sequence per sample as with QIIME1, grep, etc.
What did we wrong? Where do you think the problem is? Can we trust our “qiime cutadapt demux-single” results at least for chips with samples from only one experiment?

Thank you

Stefan

Stratering · July 17, 2020, 1:17pm

Hello QIIME2 team,

no reply to our problem! Maybe our text was too boring or too long. Anyway, we did a few more analyses, and always “demux cutadapt demux-single” gave more sequences/sample (up to 10000) as the QIIME1 or Laros demultiplexing tool.
We decided not to trust anymore in the cutadapt tool it seems to have a bug and not work correctly.

May be interesting for other Ion Torrent user our new pipeline:
1, demultiplexing the fastq-file with the tool of J.F.J Laros (demultiplex demux -r -m 0 -e “barcode_lenght” barcode.csv sequence_file.fastq)
2. rename the file names with thunar (or other similar file manager/script able to rename batches) to the “CasavaOneEightSingleLanePerSampleDirFmt” Format (sampleID_BarcodeNr_L001_R1_001.fastq)
3. gzip the single files (gzip path_to_directory -r)
4. import into QIIME2 (–type ‘SampleData[SequencesWithQuality]’ ; --input-format CasavaOneEightSingleLanePerSampleDirFmt)
5. DADA2…

Maybe it helps somebody else!

Stefan

thermokarst · July 17, 2020, 2:11pm

cutadapt has an extensive set of documentation (link), have you had time to review it? Specifically, you need to make sure you're specifying the types of adapters correctly in your sample metadata file.

I'm not too sure - while this post is quite long, it is light on specifics. Can you give us some examples of your reads (say, the first 10 lines of a file)? Examples of how you have specified your barcodes (again, first 10 lines or so)?

Sorry, we are trying our best to provide free support, and we just haven't had time to jump into this post, yet. In general though, posts that are long, not clear, or don't have specific requests for help present more of a challenge, and are more likely to go longer without a reply (I think that is just human nature, right?).

Stratering · July 17, 2020, 8:25pm

Thank you for your answer!

I try to repeat very short the problem. With "qiime cutadapet demux single" we found significantly more sequences/barcode (up to 10,000) as with other tools (QIIME1 or demultiplexing tool (Laros) or direct counting in the sequencing file with grep). Also mixing of sequences of different barcodes in one file was observed but not with the other tools.

Please find attached an example of the barcodes. We used the majority of the barcodes since 2014 (with QIIME1) and the 10bp barcodes came from the 464 Roche time when we started (2012) high throughput sequencing. The longer ones are from IonTorrent manuals.

I also attached the first few sequences (out of 6,644,803 sequences) form a run a few months ago. The adapters were already removed by the IonTorrent sequencer software. Between barcodes and primer sequences "GAT" as a linker sequence could be found.

Sequences_example.txt (3.3 KB) Barcodes.txt (1.1 KB)

Stratering · July 18, 2020, 7:32pm

Maybe the problem becomes more clear with a view screenshots of a run with 59 filter samples and 4 samples of bacteria sequences from an algae (german: algen) culture. All tools were run with zero mismatches/errors.

Cutadpt with a Meta Data file only with the 4 barcodes of the samples of the algae culture:

Cutapt with a Meta Data file with all 63 barcodes

QIIME1 demultiplexing tool

Meta Data File with all barcodes (QIIME1)

Demultiplexing tool (Jeroen F.J. Laros)

thermokarst · July 18, 2020, 7:36pm

Thanks @Stratering, it doesn't look like you answered my question, so I will repeat here:

It looks like you are seeing a majority of your reads being assigned to just a few samples - when we see this it is usually an indication that the user hasn't specified their barcodes as per the cutadapt specification (linked/anchored/etc). Please review the docs above.

I also need to point out - q2-cutadapt is just a very thin wrapper against cutadapt - it does not introduce new behavior or functionality, generally speaking.

Stratering · July 19, 2020, 11:43am

Thank you for your answer.

Sorry but I try to answer your questions in my reply (07/18/2020) “The adapters were already removed by the IonTorrent sequencer software”. Only the linker sequence between barcode and primer is left “Between barcodes and primer sequences “GAT” as a linker sequence could be found”. I also upload a part of the barcode file and a few sequences as you suggested.

Where I should define my barcodes. the functions/specifications you mentioned in the cutadapt documentation are not part of the “https://docs.qiime2.org/2020.6/plugins/available/cutadapt/demux-single/”. In the barcode file? Should I run cutadapt independently from QIIME2?

Your assumption the majority of your reads being assigned to just a few samples is so not true if we run “qiime cutadapt demux-single” with all barcodes we have normal distribution as we know it from QIIME1 but always up to 10000 (range 70-12000) sequence/barcode more as in the raw file or QIIME1 or other_demultiplex tools. The error seems barcode sequencing depending. Always the same barcodes gave the height differences (I can upload a graph showing this).

Only if we shorten the MetaFile to a few barcodes like in the algae example we have these huge differences.

thermokarst · July 19, 2020, 5:46pm

Thanks - but I think there is still some confusion - cutadapt refers to all "non-biological sequence" as "adapters" - if you spend some time reviewing the documentation links that I have shared you will hopefully see what I mean. The point is, your barcodes are adapters (at least in cutadapt's eyes - it is even part of the name of the tool!).

If you are referring to your sample metadata file, then yes! You might need to specify a different adapter types, though, for example "GATAYTGGGYDTAAAGNG$" (instead of "GATAYTGGGYDTAAAGNG"), but, I can't tell you what cutadapt adapter type to use, since it is specific to the composition of the sequencing product.

I think your observation actually supports my hypothesis - if there are fewer adapters in the overall library, and the adapters are specified incorrectly, cutadapt has a tendency to match all reads to a few samples. By adding the other samples/barcodes back in, you make it so that there are other potential matches for cutadapt, alleviating the issue.

Stratering · July 20, 2020, 7:32am

Thanks for your time and help.

system · August 20, 2020, 1:32pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.