Split one merged fastq into per-sample fastq files

This is the same as these two posts that was never formally answered for generating the same output of separate per-sample fastq files from one merged fastq file.

I used to do this with http://qiime.org/scripts/split_sequence_file_on_sample_ids.html , but that is no longer working. Is there an equivalent command in QIIME2?

Specifically, I used QIIME1 to get these files in the correct per-sample fastq format to then bring into data2 for processing. The original files where EMP fastq files that were then demultiplexed and made into per run demultiplexed fastq files. These files I want to combine with new data that has per-sample fastq files and will run through dada2. Then, in dada2 I would merge the data.

@CarlyRae

Thanks for reaching out! I want to make sure that I understand what you are trying to do. From your post, it looks like your main workflow is in DADA2 (not the Qiime 2 plugin) and that you have been using the Qiime 1 script to get separate files for each sample so that you can feed them into DADA2. Is this correct?

Also do you know the format that your data is in (Cassava, EMP-protocol, etc)? Do you have your barcodes in a separate file or inline?

You also may want to take a look at Overview of QIIME 2 Plugin Workflows — QIIME 2 2021.2.0 documentation to see if that answers any of your questions. It may be much more relevant though if your workflow was mostly in Qiime 2. Additionally, we have a variety of filtering and merging tools available within Qiime 2 if you would be interested in moving to Qiime 2. This would also give you better provenance tracking and could simplify the combining of your different data sources.

1 Like

Hi @Keegan-Evans. Thank you for the reply.

Ok, answer is no there is no equivalent to the QIIME1 split_sequence_file_on_sample_ids command in QIIME2 I take it.

What I have is (1)‘old data’ generated using the EMP-protocol everything ran through QIIME2 and (2) ‘new data’ generated using my in-house protocol with same forward primer, but different reverse primer than EMP, and de-multiplexed by Miseq so in Cassava format. I want to combine those ASVs and make a beautiful pilot data figure for an upcoming grant.

I am guessing though I will need to go back to the ‘old data’ and only use the forward read and re-run through dada2 (in QIIME2) and trim at the same place as the ‘new data’ with only the forward run. Then merge the feature-tables. Does this sound like the correct workflow to you?

But, yes you are right - why not just use QIIME2 to do the work for me. Sure, I’m in. Decade long user of QIIME and always in to use these amazing tools you all have made.

@CarlyRae

@CarlyRae

Ok, answer is no there is no equivalent to the QIIME1 split_sequence_file_on_sample_ids command in QIIME2 I take it.

While it is not a direct equivalent, ‘demux’ generates a series of per-sample files that can be accessed by unzipping the .qza, if that ends up being something that you really need in the future.

However, in your case, a workflow similar to what you described would be cleaner and easier. There are few really important points here though to ensure you end up with high quality results.

  1. De-noising needs to be done on a per-run basis. The merge tools provided by QIIME2 can take an arbitrary number of datasets as arguments, so if you have 7 sequencing runs, de-noise them separately and there should be no problem combining all 7 at the same time. This has to do with how the de-noising algorithms work.

  2. Your data must be stripped of all non-biological information, that is primers and barcodes. The upshot of this is that once you get the data into a ‘FeatureTable’ and ‘FeatureTable[Sequence]’, it doesn’t matter what primers you used to sequence your data, ‘feature-table merge’ will put them together.

  3. All of the ASVs absolutely must be the same length to get valid analysis results. By your question you know this, but I wanted to leave a note of it here for anyone reading this down the road. In your case, you may or may not need to trim your old data again, most likely you will.

I think you should be able to just import all of your data in as a ‘FeatureTable’ and ‘FeatureTable[Sequence]’, then merge all of each together. You can then use the merged data for your downstream processing.

For an example of what this would look like you could take a look at the FMT study exercise and the ‘feature-table merge’ documentation. While I believe each of these only shows two datasets in any one operation, any number of ‘FeatureTable’ or ‘FeatureTable[Sequences]’ can be provided, allowing you to do all of the merge operations on each type at the same time.

As a point of clarification, when I say merge here, I mean in the SQLesque ‘Join’ sense where the goal is simply to combine multiple datasets rather than mean the operation by which forward and reverse sequencing reads are matched together, which confusingly can go by the same term. That is how I am reading what you are trying to do, but yet again I wanted to make it clear for future readers.

Hopefully that gives you what you need to be able to move forward. However, getting data into Qiime 2 is often one of the more difficult and complicated steps so we are more than happy to help if you have any more questions.

Here is my code in case of use for anyone in the future. @Keegan-Evans let me know if you see any red flags. It was quick so skipped several steps. Then took over into R to make a quick figure. I appreciate the support. Did the trick.

‘Old data’ from EMP set-up

qiime tools import
–type EMPPairedEndSequences
–input-path folder path
–output-path emp-paired-end-sequences.qza

qiime demux emp-paired
–m-barcodes-file mapping file path
–m-barcodes-column BarcodeSequence
–i-seqs qza sequence file
–o-per-sample-sequences demux \

Use only forward reads since we have different reverse primers .

Looks like forward primer removed in demux code, looking at rep-seqs.qza after code finishes

qiime dada2 denoise-single --i-demultiplexed-seqs demux.qza --p-trim-left 0 --p-trunc-len 250 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

qiime tools export --input-path rep-seqs.qza --output-path rep-seqs-PGF4

New data on from in-house protocol, demultiplexed on Miseq

In order to import them into qiime without error this is the easiest way: make them Casava formatted.

Scroll down in link below to find info about this style of sequences

Casava 1.8 demultiplexed fastq
https://docs.qiime2.org/2018.6/tutorials/importing/#sequence-data-with-sequence-quality-information

But, the way you get them off Miseq in per-sample folders. Move into the main folder above per-sample folder and move all files within each subfolder into one main folder - this is casava format (all forward and reverse reads for all samples in one folder.

cp -v *_L001*/* ~/Documents/Reads

then remove those folders

rm -r *_L001*/

Want to import only forward reads, make sure gzipped. Moved only forward subset manually into this folder ~/Documents/Reads/ForwardOnlySubset

gzip ForwardOnlySubset/*

qiime tools import
–type ‘SampleData[SequencesWithQuality]’
–input-path ForwardOnlySubset
–input-format CasavaOneEightSingleLanePerSampleDirFmt
–output-path demux-single-end.qza

Since this protocol doesn’t remove forward primer in demux (like EMP), we set it so total length for both EMP and this protocol is 250 bp. Here 269 - 19 (forward primer bp length)

qiime dada2 denoise-single --i-demultiplexed-seqs demux-single-end.qza --p-trim-left 19 --p-trunc-len 269 --o-table table2.qza --o-representative-sequences rep-seqs2.qza --o-denoising-stats denoising-stats2.qza

qiime tools export --input-path rep-seqs2.qza --output-path rep-seqs-AmbXeno

qiime feature-table merge --i-tables table.qza --i-tables table2.qza --o-merged-table table-merged.qza

Just taking table out, don’t need to know taxonomy for this prelim figure

qiime tools export --input-path table-merged.qza --output-path exported-feature-table

biom convert -i feature-table.biom -o table.from_biom.txt --to-tsv