Combine two sets of illumina 16S rRNA sequences made independently and build an artifact file or any other option of analysis?

Felix_A_G · January 22, 2018, 4:18pm

Hi I have the following situation: I have 2 sets of 16S rRNA sequences and I would like to do the analysis of all the samples together. I only have the forward and reverse fastq files and the mapping files for every pair of raw data. Something important to say is that each pair of raw fastq file has sequences of more than one kind of sample, in total I have 16 pairs of raw files and 16 mapping files.

It´s possible to use qiime2 and create and artifact file?

I attach a mapping file to suggest some analysis option?

Mapping_file_S25.txt (530 Bytes)

thermokarst · January 22, 2018, 8:49pm

Hi @Felix_A_G!

When you say "each pair of raw fastq file has sequences of more than one kind of sample," what does that mean? Is each pair of fastq files one run? Are the same samples present in multiple fastq pairs? I have a few ideas, but I want to make sure that we are on the same page regarding the data in front of you. Can you provide a bit more detail about this set up? Thanks so much!

Felix_A_G · January 23, 2018, 4:59pm

Hi Matthew !

I have two different runs of fastq Raw paired files (forward and reverse), one of these run with 12 fastq paired files (with 52 samples mixed in the twelve fastq paired files) and the second run with 4 fastq paired files (with 32 samples mixed in the four fastq paired files). I want to compare the bacterial communities in the 84 samples. I have the individual mapping file for each run. (12 + 4)

I appreciate your answers

Thank you

thermokarst · January 23, 2018, 10:56pm

Thanks @Felix_A_G! Another followup question - how did one run produce 12 pairs of reads? Sounds like it was already partially demultiplexed - how was that decision made? Are all 52 samples present in all 12 of those read pairs? This has me wondering, are they actually samples, or are there sample replicates at play here, too? Sorry for so many questions, I just want to make sure I am on the same page!

Felix_A_G · January 26, 2018, 4:21pm

Hi Matthew, they are partially demultiplexed, there are 52 samples in the 12 pairs of fastq, the samples are not repeated in the fastq files, I attach you two examples of mapping files. In the other group I have 32 samples in another 4 files. There is 1 year difference between the processing of the first 52 and the rest.

Thank you

Mapping_file_S28.txt (541 Bytes)
Mapping_file_S32.txt (556 Bytes)

thermokarst · January 26, 2018, 4:50pm

Cool, thanks for the info. Are you planning on processing with DADA2? If so, it is important to not combine any sequencing runs, DADA2 expects samples from one run at a time while it does its thing.

The partial demultiplexing seems pretty strange to me, and looks like it might make more work for you than desirable. If I were in your shoes, I would start as far upstream as possible and build demultiplex all of the samples at once, although I recognize this is probably easier said than done.

If you want to keep going with the data you have, as-is, here is a strategy that I think can get you moving (note, this only works if you still have barcode sequence in your reads). I will outline a hypothetical data directory:

raw reads

$ ls .
group_a_r1.fastq.gz
group_a_r2.fastq.gz
group_b_r1.fastq.gz
group_b_r2.fastq.gz

Groups A and B are just generic placeholder here, from what I understand from you, these would actually be your partially demultiplexed reads, like S28 & S32.

group a demux-only metadata, group_a.tsv

#SampleID	Barcode
a1	AAAA
a2	GGGG
a3	CCCC
a4	TTTT

group b demux-only metadata, group_b.tsv

#SampleID	Barcode
b1	AAAA
b2	GGGG
b3	CCCC
b4	TTTT

import and demux group a

$ mkdir group_a
$ ln group_a_r1.fastq.gz group_a/forward.fastq.gz
$ ln group_a_r2.fastq.gz group_a/reverse.fastq.gz
$ qiime tools import \
  --type MultiplexedPairedEndBarcodeInSequence \
  --input-path group_a \
  --output-path group_a.qza
$ qiime cutadapt demux-paired \
  --i-seqs group_a.qza \
  --m-barcodes-file group_a.tsv \
  --m-barcodes-category Barcode \
  --p-error-rate 0 \
  --o-per-sample-sequences demux-a.qza \
  --o-untrimmed-sequences untrimmed-a.qza \
  --verbose
# Hypothetical downstream steps
$ qiime demux summarize ... 
$ qiime dada2 denoise-paired ...

import and demux group b

$ mkdir group_b
$ ln group_b_r1.fastq.gz group_b/forward.fastq.gz
$ ln group_b_r2.fastq.gz group_b/reverse.fastq.gz
$ qiime tools import \
  --type MultiplexedPairedEndBarcodeInSequence \
  --input-path group_b \
  --output-path group_b.qza
$ qiime cutadapt demux-paired \
  --i-seqs group_b.qza \
  --m-barcodes-file group_b.tsv \
  --m-barcodes-category Barcode \
  --p-error-rate 0 \
  --o-per-sample-sequences demux-b.qza \
  --o-untrimmed-sequences untrimmed-b.qza \
  --verbose
# Hypothetical downstream steps
$ qiime demux summarize ... 
$ qiime dada2 denoise-paired ...

You would need to do this once per your partially demuxed reads, so 12 times for the first set, and 4 times for the second. This is why my initial recommendation was to see if you can get your hands on the reads before they were partially demultiplexed, because it might be easier to handle there (you would only need to demultiplex twice, once per run). There is a lot to digest here, so please let us know if you have any questions!

system · February 26, 2018, 11:11pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.