Thanks for taking this up, and good to see there's already progress!
What type of data do you have?
What are you doing upstream prior to importing to QIIME 2?
I prepared environmental samples with Nextera index adapters, and our core facility processed a pool of it through Illumina sequencing, producing paired reads of 250bp each. Processing there includes demultiplexing and conversion into .bam format; their steps are documented in the headers of the BAM files:.
Steps done at the core facility according to `samtools view -h some.bam`
@HD VN:1.5 SO:unsorted
@RG ID:000000000-DD2KY.1#TCGACGTCTATCCTCT PU:210728_M04059_0033_000000000-DD2KY#TCGACGTCTATCCTCT PL:ILLUMINA LB:168569 SM:168569
@PG ID:SCS PN:MiSeq Control Software VN:4.0.0.1769 DS:Controlling software on instrument
@PG ID:basecalling PN:RTA VN:1.18.54.4 PP:SCS DS:Basecalling Package
@PG ID:bcl2fastq PN:bcl2fastq CL:bcl2fastq --runfolder-dir /groups/vbcf-ngs/seq/miseq/210728_M04059_0033_000000000-DD2KY --no-bgzf-compression --output-dir /scratch/csfs/pipelines/illumina2bam-dir/210728_M04059_0033_000000000-DD2KY/work/27/7cb4df26853164a803a738e0fef957 --tiles s_1 --loading-threads 4 --processing-threads 4 --writing-threads 1 --ignore-missing-bcls --create-fastq-for-index-readsVN:2.20.0 PP:basecalling
@PG ID:fastq2bam PN:fastq2bam CL:java -jar fastq2bam.jar --runfolder /groups/vbcf-ngs/seq/miseq/210728_M04059_0033_000000000-DD2KY --outputDir . --lane 1 --date 29-07-2021 --libraries 0 --samples 0 VN:0.7 PP:bcl2fastq
@PG ID:VBCFDemultiplexer PN:VBCFDemultiplexer CL:demultiplex --min-dist 1 --barcodepath 000000000-DD2KY_1_barcodes.tab --filetype text --prefix 000000000-DD2KY_1 --mut1 1 --mut2 1 --len1 8 --len2 8 --inpath /groups/vbcf-ngs/seq/miseq/210728_M04059_0033_000000000-DD2KY/Data/Intensities/BaseCalls/illumina2bam0.03_29-07-2021/000000000-DD2KY_1_29-07-2021.bam --correctdual none --keepconflicting --dealrandom convertBlank --transform none --asis VN:0.8 PP:fastq2bam
@PG ID:samtools PN:samtools PP:VBCFDemultiplexer VN:1.13 CL:samtools view -h 000000000-DD2KY_20210728B_demux_1_R12037_20210730/000000000-DD2KY_1#168569_TCGACGTCTATCCTCT.bam
M04059:33:000000000-DD2KY:1:1101:16641:1390#TCGACGTCTATCCTCT 77 * 0 0 * * 0 0 AACAGGATTAGATACCCGGAAGGCGGGGACGACGTCTGTCTCTTATACACATCTCCGAGCCCACGAGACTCGACGTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAATACACATTATTGCCCTGCGTACGTTTCAGGCCGTAAATAATCCCCCATTCCTTGCCGGTATCCTAACCCCCTTCTTGGCTACCATCCACCGGATCCACATCCTGTGCCTTCGGTGGTGTTGTCGGTCCGTACT BBABBBFFFFFFGGGGGGGGGGGHGGGGGGGGGGGGHGHHHHHHHHHHHHHHHHHHGGGGGGHGGGGGGHHGHGGFGGHBHHHGGHHHHGDCHGHHHHHHHHHHHHGGGGGA;;9--:C00;9000BB0000B.ABA.;-:///////;-:.9:BB/9///.9:.://9;///--;-A//:/9;9..;-.;/:///9/::BFB.B/..--;-:9/:///;////////.....;....;/.--.9-..:. B2:Z:TATCCTCTT Q2:Z:BBCCCFFFF BC:Z:TCGACGTCA RG:Z:000000000-DD2KY.1#TCGACGTCTATCCTCT QT:Z:CCCCCCBCC
They provide the BAM files in a ZIP archive, together with what appears to be automated analyses done over the data as they see fit (for example bloomcat and fastqc evaluations, which I'm discarding).
My current processing involves converting the .bam
files to .fastq
files with samtools, and then using the PairedEndFastqManifestPhred33V2 input format to obtain a SampleData[PairedEndSequencesWithQuality] artifact.
samtools conversion details that are probably not relevant but listed for completeness
fastq/%-forward.fastq fastq/%-reverse.fastq:
samtools fastq "$<" -1 "fastq/$*-forward.fastq" -2 "fastq/$*-reverse.fastq" -t -n
The input files for the respective samples are generated from the original sample list (which includes the adapter sequences and builds the file names from that). With the -1
and -2
output file names, the SAM file's semantics of of READ1 or READ2 are turned into the separate files the PairedEndFastqManifestPhred33V2 format expects. Flag -t
keeps some metadata (which are probably discarded later anyway), and -n
strips the /1
and /2
off the tail of the read names, which might otherwise impede the matching between the forward and reverse reads in the qiime importer.
And what do you plan to do downstream?
My current pipeline goes into dada2 denoise-paired, producing the expected ~400bp features that are expected.
From there, the qiime phylogeny takes over for tree building, and feature-classifier is used to match against databases, hoping to show indicator taxa for different sample groups.
This is already inherent to the semantic type system. I.e., different semantic types specify whether the data are single- or paired-end, and these types are not interchangeable. Those semantic types are recorded in provenance, and can also be quickly checked (e.g., see qiime tools peek
).
Yes -- and it seems to me that similar data is also available in BAM files (by having READ1 and READ2 entries, and by having the headers in from the samtools conversion item above). With my current process of converting SAM files to FASTQ and loading them, I lose information that an import step could have carried around.
Even if the SAM file's provenance is structured differently, it could still probably be stored as a text block -- and the presence of a READ1 and READ2 into a SampleData[SequencesWithQuality] rather than a SampleData[PairedEndSequencesWithQuality] might set off alarm bells or fail. (Error message playing in my head: "The imported file contains READ1 and READ2 data, indicating paired reads. This is not supported for the requested output format of SampleData[SequencesWithQuality]. Use output format SampleData[PairedEndSequencesWithQuality], or set --force-format
to discard the direction information").