Filtering and then merging 2 different runs- same barcodes used in both runs

pjtorres · July 26, 2017, 4:28pm

Hey Folks,

I have 2 experiments that were done on two different runs. However, half of the samples from experiment 1 and half of the samples from experiment 2 were done on one run (Run 1) and the other half of both experiments were done on another run (Run 2). What is the best way to filter out samples from Run 1 and Run 2 that belong to experiment 2 and then merge experiment 1 samples for analysis? I already demultiplexed each run independently. Same barcodes were used in both runs therefore I could not merge my fastq files right away and then demultiplex. Is there a way to merge both of my demux.qza files and then filter out all samples belonging to experiment 2 based on sample ID?

I have seen this tutorial . There is a way to merge and then filter your table.qza after sequence quality control but I don’t think there is a way to filter the rep-seqs.qza before generating a tree-right?

Any tips? Much appreciated!

jairideout · July 27, 2017, 10:42pm

Hi @pjtorres! These are great questions, we're having an internal discussion about the best way forward and will follow up here shortly.

Were both experiments/runs generated using the same primer pair?

pjtorres · July 27, 2017, 11:36pm

Hey @jairideout,

Thanks for getting back. Yes everything was the same except for the samples. Same primer pair from EMP, same barcodes, ect... This is my current workaround after struggling for a while:

Forgot to mention my barcodes and sequences are in a single fastq file

Substituted the first nucleotide with a "P" in every new fastq line for my Run2 (barcode is in the 5' end of all fastq files).
sed '2~4s/^\(.\{1\}\)/P/' Run2.fastq > Run2_nucreplacedwP.fastq
Did the same for the BarcodeSequence column in my Run2 mapping file
extracted_barcodes.py
Then merged my two runs cat Run1.fastq Run2_nucreplacedwP.fastq >> merged_runs.fastq and then made qiime2 artifacts.

I demultiplexed Run 2 alone and then compared the per sample sequence counts to my merged Run1 and Run2 (with the first nucleotide substitution) and I get the same results.

thermokarst · July 31, 2017, 4:11pm

Hi @pjtorres, thanks for the update! We haven't forgotten about you, we are still having that internal discussion @jairideout mentioned above. Stay tuned!

jairideout · July 31, 2017, 6:50pm

Thanks for those details about your workflow!

You're correct that QIIME 2 doesn't currently support filtering representative sequences. We have an open issue to add this support and will follow up here when it's in a release.

After some internal discussion, we recommend:

Process each sequencing run independently with qiime dada2 (you'll get best results that way).
Merge the two feature tables and two sets of representative sequences created by dada2.
Split the feature tables using feature-table filter-samples to create a table for each experiment.
Align, filter, and build a phylogenetic tree using the merged representative sequences from step 2.
Use this tree in downstream analyses for either of your experiments.

The tree that you build from the merged representative sequences will contain sequences found in both runs/experiments, but using this "merged" tree shouldn't impact results in a meaningful way over using representative sequences that have been filtered by experiment (this is only true when using the same primer pair).

You might want to try both approaches (using the "merged" tree vs using experiment-specific trees) and compare the results. To filter your representative sequences to make them experiment-specific, you can:

Export the merged representative sequences and the per-experiment feature tables using qiime tools export. This will result in a FASTA file and two .biom files.
Use a tool such as QIIME 1's filter_fasta.py to split the rep seqs based on the feature IDs in the per-experiment .biom files created in the previous step.
Import the per-experiment rep seqs using qiime tools import and build a tree from each.

DADA2 works best by processing each sequencing run separately and then merging the tables and representative sequences. I recommend:

Create a QIIME 1 mapping file for each sequencing run.
Use QIIME 1's extract_barcodes.py on each of your fastq/mapping files.
The results from the previous step are now in the "EMP protocol multiplexed format" (as defined by QIIME 2). Import each set of "EMP multiplexed" sequences following this section of the importing tutorial. You can choose to import as single- or paired-end data depending on what type of data you have.
Use qiime demux emp to demultiplex each of the .qza files created in the previous step.
Use qiime dada2 to denoise each demultiplexed .qza and merge results (this is detailed at the beginning of my post).

pjtorres · July 31, 2017, 8:58pm

Awesome!. Thank you all for your help and input. Much appreciated!

ebolyen · September 29, 2017, 7:33pm

QIIME 2 2017.9 now supports filtering your representative sequences with qiime feature-table filter-seqs!