Cutadapt to split multiple regions per sample

jwdebelius · May 19, 2020, 10:11pm

I'm trying to figure out an elegent way to apply double demultiplexing on samples which have already been demultiplexed once. (Similar to the Ion Torrent/Multiple regions) problem. (Maybe @cjone228 and @Lauren have already solved this, and I haven't found the thread.)

The inelegent solution Ive come up with is to import each sample as a multiplexed sample, demultiplex by region, filter each sample into regions, and then try to find a way to merge the sequences before running dada2.

qiime tools import \
 --input-path example-seqs/ \
 --output-path example-all.qza \
 --type 'MultiplexedPairedEndBarcodeInSequence'

 qiime cutadapt demux-paired \
 --i-seqs ./example-per-sample-seqs.qza \
 --m-forward-barcodes-file barcodes.tsv 
 --m-reverse-barcodes-file barcodes.tsv \
 --m-forward-barcodes-column fwd-barcode \
 --m-reverse-barcodes-column rev-barcode \
 --o-per-sample-sequences example-by-region.qza \
 --o-untrimmed-sequences example-unmapped.qza

At which point, i can seperate things back out into per-region files and then loop back into QIIME 2 by exporting the artifact, building a per-region manifest, and then re-importing. (Given my occasionally questionable typing skills, there is a non-zero probability this will lead to an error between and ).

So, one thought for larger studies would be to have a cutadapt method that worked on a per-sample basis and then seperated the samples back out to a set of SampleData[SequencesWithQuality] for denoising, etc.

Again, Im not sure if the problem has been solved, but Im struggling with a solution that doesn't require several data-related liberties.

Best,
Justine

Lauren · May 20, 2020, 5:34pm

Hi Justine,

Thanks for thinking of us! I'm sorry to say that I don't have a solution for you, but I can discuss our situation a little in case it helps. We have two scenarios: Scenario 1 where we keep our entire file intact (with all V regions in the same file) or Scenario 2 where we split by V region and analyze separately in order to compare between V regions. I believe our scenario 2 is very similar to what you are describing here. Because ThermoFisher won't give us the primer sequences, we finally convinced them to separate out V regions into separate fastq files for us. The downside is, they wrote the script so I am unsure exactly what they did aside from the fact that I know they used cutadapt. Your script above seems reasonable. We are now left with separate fastq's for each V region just as you would be with your script above. Our plan was also to import into QIIME2 via per-region manifest as you describe.

I believe, @Jen_S and @KMaki are also splitting fastq's at the beginning of their pipeline. They may have some additional insight.

Best,
Lauren

KMaki · May 27, 2020, 7:06pm

Thanks for thinking of me @Lauren happy to try and help!

Hey Justine @jwdebelius - Super sorry for the delay here! Somehow missed this alert.

Just so I am clear, you want to take demultiplexed fastq files and then separate out sequences again to additional fastq files based on a parameter (for example, by V region?)

The files that we have from our Ion Torrent sequencer are demultiplexed so there is just one fastq per id, but each fastq file has several variable regions in the file that we split out with cutadapt using primers as our filtering criteria. We input as single end, but they are mixed orientation (we combine forward and reverse reads after dada2 is run). I will summarize our current workflow with the commands we use and happy to expand on anything that’s unclear or needs more detail. Also- if you see something that looks wrong please let me know

Import demultiplexed by id:

qiime tools import --type "SampleData[SequencesWithQuality]" 
--input-format SingleEndFastqManifestPhred33V2 
--input-path ./manifest_run01.tsv 
--output-path ./demux_seqs_run01.qza

Then we perform cutadapt, for each region and direction example V2f/V2r. The workflow I am performing is on mock sequences and there are a bunch of different runs so I have to perform cutadapt separately by run and by region but I run them in parallel using our HPC.

qiime cutadapt trim-single 
--i-demultiplexed-sequences ./demux_seqs_run01.qza 
--p-front XXXX(primer is put here) 
--p-error-rate .1 
--p-indels 
--p-match-read-wildcards 
--p-discard-untrimmed 
--o-trimmed-sequences run01_v2ftrimmed.qza

From here I will have separate folders for all of my regions/runs, and run dada2 for denoising for each separate .qza files (each region/direction/run). We use denoise pyro since its ion torrent data

qiime dada2 denoise-pyro \
--p-trim-left 15 \
--p-trunc-len 250 \
--p-max-ee 1.0 \
--i-demultiplexed-seqs ./run01_v2f_trimmed.qza \
--o-table ./dada2_pyro_run01_v2f_table.qza \
--o-representative-sequences ./dada2_pyro_run01_v2f_rep_seq.qza \
--o-denoising-stats ./dada2_pyro_run01_v2f_stats.qza

I think the key here is having the primers if that is how you want to “re-demultiplex” since cutadapt can then pull out the regions and discard everything else based which primers match correctly. Hopefully this is helpful, let me know if I can expand on anything!

Katherine

jwdebelius · May 27, 2020, 10:48pm

Hi @KMaki,

Thank you so much! This is exactly what I was thinking about. I was going to suggest developing it, but Im glad you've already solved the problem. Whoo!

And thank you @Lauren for the recommendation and help.

Best,
Justine

KMaki · May 27, 2020, 10:50pm

Happy to help! Def need to give @Jen_S the credit for this because she introduced me to this script in the first place Please let me know if you run into any issues
Sincerely,
Katherine

thermokarst · December 29, 2020, 4:14am

A post was split to a new topic: How can I use this data to divide it into 7 areas?