Import from .bam files: tool or tutorial

Having spent way more time than I'd like to admit finding that the data I obtained from sequencing does indeed contain paired reads, I'd like to smooth out road bumps for future users that try to get started from .bam files. The major tripping point that sent me to detours was that conveters (samtools which I went for, but also bedtools' bamtofastq and picard's SamToFastq as linked in a related post) happily take a bam file that has good (well at least present) metadata on the data being double-ended and throws it out in the default configuration of producing a single fastq file.

The smoothest experience would have been if qiime tools import would accept SAM data; this would not only be saving one more conversion step (which is annoying but doable, especially when one creates a manifest anyway). It would moreover have the advantage of carrying around information on the single- or double-endedness of reads around without loss at a manual import step, and would furthermore ease traceability by taking up the actual source data file in the provenance history.

Where could an issue be filed for addition of such an import? (With the plugin architecture of qiime, it is not immediately clear to me which repository that would be related to). Is there anything to be cleared before it would be filed? Technical details like which dependencies this would pull in (possibly samtools, which is q2-quality-control already depends on) could still be discussed there on the path going from a wishlist issue to a pull request.

If this is unfeasible for some reason, the next-best enhancement would be a pointer from the import tutorial to suitable tools and how to use them; is there a policy on quality of implementation a tool needs to have to be recommended there?

Hi @chrysn ,

Welcome to the forum!

To start the discussion, maybe it would be good to have a better picture of what your data and planned pipeline looks like.

  • What type of data do you have?
  • What are you doing upstream prior to importing to QIIME 2?
  • And what do you plan to do downstream?

We are already in progress to add support for BAM format in the very near future (and SAM could be added too). But the formats are linked to specific semantic types that define how they are used downstream. Right now this is specific to (meta)genome sequence alignments, but in theory the format could also be linked to a different semantic type if the appropriate transformers are added.

This is already inherent to the semantic type system. I.e., different semantic types specify whether the data are single- or paired-end, and these types are not interchangeable. Those semantic types are recorded in provenance, and can also be quickly checked (e.g., see qiime tools peek).

Let's keep the discussion going here for now and when we have a better idea of your use case we could find the right place to put an issue (if one is still needed).


Thanks for taking this up, and good to see there's already progress!

What type of data do you have?
What are you doing upstream prior to importing to QIIME 2?

I prepared environmental samples with Nextera index adapters, and our core facility processed a pool of it through Illumina sequencing, producing paired reads of 250bp each. Processing there includes demultiplexing and conversion into .bam format; their steps are documented in the headers of the BAM files:.

Steps done at the core facility according to `samtools view -h some.bam`
@HD     VN:1.5  SO:unsorted
@RG     ID:000000000-DD2KY.1#TCGACGTCTATCCTCT   PU:210728_M04059_0033_000000000-DD2KY#TCGACGTCTATCCTCT  PL:ILLUMINA     LB:168569       SM:168569
@PG     ID:SCS  PN:MiSeq Control Software       VN:   DS:Controlling software on instrument
@PG     ID:basecalling  PN:RTA  VN:    PP:SCS  DS:Basecalling Package
@PG     ID:bcl2fastq    PN:bcl2fastq    CL:bcl2fastq --runfolder-dir /groups/vbcf-ngs/seq/miseq/210728_M04059_0033_000000000-DD2KY  --no-bgzf-compression --output-dir /scratch/csfs/pipelines/illumina2bam-dir/210728_M04059_0033_000000000-DD2KY/work/27/7cb4df26853164a803a738e0fef957 --tiles s_1 --loading-threads 4 --processing-threads 4  --writing-threads 1  --ignore-missing-bcls  --create-fastq-for-index-readsVN:2.20.0       PP:basecalling
@PG     ID:fastq2bam    PN:fastq2bam    CL:java -jar fastq2bam.jar --runfolder /groups/vbcf-ngs/seq/miseq/210728_M04059_0033_000000000-DD2KY --outputDir . --lane 1 --date 29-07-2021 --libraries 0 --samples 0       VN:0.7  PP:bcl2fastq
@PG     ID:VBCFDemultiplexer    PN:VBCFDemultiplexer    CL:demultiplex --min-dist 1 --barcodepath --filetype text --prefix 000000000-DD2KY_1 --mut1 1 --mut2 1 --len1 8 --len2 8 --inpath /groups/vbcf-ngs/seq/miseq/210728_M04059_0033_000000000-DD2KY/Data/Intensities/BaseCalls/illumina2bam0.03_29-07-2021/000000000-DD2KY_1_29-07-2021.bam --correctdual none --keepconflicting --dealrandom convertBlank --transform none --asis       VN:0.8  PP:fastq2bam
@PG     ID:samtools     PN:samtools     PP:VBCFDemultiplexer    VN:1.13 CL:samtools view -h 000000000-DD2KY_20210728B_demux_1_R12037_20210730/000000000-DD2KY_1#168569_TCGACGTCTATCCTCT.bam

They provide the BAM files in a ZIP archive, together with what appears to be automated analyses done over the data as they see fit (for example bloomcat and fastqc evaluations, which I'm discarding).

My current processing involves converting the .bam files to .fastq files with samtools, and then using the PairedEndFastqManifestPhred33V2 input format to obtain a SampleData[PairedEndSequencesWithQuality] artifact.

samtools conversion details that are probably not relevant but listed for completeness
fastq/%-forward.fastq fastq/%-reverse.fastq:
        samtools fastq "$<" -1 "fastq/$*-forward.fastq" -2 "fastq/$*-reverse.fastq" -t -n

The input files for the respective samples are generated from the original sample list (which includes the adapter sequences and builds the file names from that). With the -1 and -2 output file names, the SAM file's semantics of of READ1 or READ2 are turned into the separate files the PairedEndFastqManifestPhred33V2 format expects. Flag -t keeps some metadata (which are probably discarded later anyway), and -n strips the /1 and /2 off the tail of the read names, which might otherwise impede the matching between the forward and reverse reads in the qiime importer.

And what do you plan to do downstream?

My current pipeline goes into dada2 denoise-paired, producing the expected ~400bp features that are expected.

From there, the qiime phylogeny takes over for tree building, and feature-classifier is used to match against databases, hoping to show indicator taxa for different sample groups.

This is already inherent to the semantic type system. I.e., different semantic types specify whether the data are single- or paired-end, and these types are not interchangeable. Those semantic types are recorded in provenance, and can also be quickly checked (e.g., see qiime tools peek).

Yes -- and it seems to me that similar data is also available in BAM files (by having READ1 and READ2 entries, and by having the headers in from the samtools conversion item above). With my current process of converting SAM files to FASTQ and loading them, I lose information that an import step could have carried around.

Even if the SAM file's provenance is structured differently, it could still probably be stored as a text block -- and the presence of a READ1 and READ2 into a SampleData[SequencesWithQuality] rather than a SampleData[PairedEndSequencesWithQuality] might set off alarm bells or fail. (Error message playing in my head: "The imported file contains READ1 and READ2 data, indicating paired reads. This is not supported for the requested output format of SampleData[SequencesWithQuality]. Use output format SampleData[PairedEndSequencesWithQuality], or set --force-format to discard the direction information").


Thanks for clarifying! This makes much more sense now.

Most sequencing service providers provide raw seq data as FASTQ (either demultiplexed or not), so I thought that you were perhaps trying to enter the pipeline midway through the analysis, or perhaps doing metagenome analysis. I see now that this is just how your sequencing center performs demultiplexing prior to delivering data.

My initial impression is that getting the multiplexed fastq data from the sequencing center might be a better way to enter the pipeline (at demultiplexing) and store that information in provenance. But I also realize that's not what you are asking :slight_smile:

Ah understood now. This is a very neat idea — to not only support SAM file specifications but also to parse the header line to add that information to QIIME 2 provenance. My worry is that it would be too complex, with so many different upstream processing possibilities, but technically possible.

Yep! This would be much easier to do by defining separate format specifications. New transformers can then be written to convert from the import format to the format stored on disk (which would still be fastq, as that is how the semantic types are currently defined, and also the data format used downstream e.g., by dada2).

You can see examples of the underlying formats and transformers here for how different BIOM format specifications can be imported, but the v1.0 format gets transformed to v2.1 when importing as a FeatureTable[Frequency] semantic type (a very different data type from what you are dealing with, but just a similar scenario of 2 formats supported at input but only 1 format is used on disk).

q2-types would technically be the right place to define new BAM/SAM format specifications, transformers, and link these to the relevant semantic types (SampleData[PairedEndSequencesWithQuality] or SampleData[SequencesWithQuality]). Check out the related formats and types here to see what is involved:

This would be the pathway forward, but maybe let's hear from @thermokarst and @ebolyen first to see if q2-types would be an appropriate place (or maybe you want to make your own plugin anyway — less oversight from us to cramp your style).

  1. add new SE/PE SAM or BAM manifest file formats: see here for examples of the current fastq manifest formats. This would define the manifest structure (which could probably be the same but with different file extentions of course), as well as validation of file contents (based on the SAM/BAM file specifications). So different formats would be used for PE or SE data.
  2. add new transformers to convert from those manifest formats to a CASAVA fastq format (i.e., change both the filenames to confirm to CASAVA naming conventions, as well as the file format). You can some other transformers for comparison here.
  3. add some unit tests as needed (this is maybe where writing your own plugin would be beneficial at least as an initial test playground, as we are very picky about making water-tight CI testing in q2-types).

The semantic types in question (SampleData[PairedEndSequencesWithQuality] or SampleData[SequencesWithQuality]) are already linked to those CASAVA fastq formats (as defined here), so once you complete steps 1 + 2 you can import BAM or SAM files (listed in a manifest file) as these types, and the data will automatically be transformed to the relevant format on disk.

If this sounds like an appealing way forward, you can read our developer documentation. Let's wait before opening an issue in q2-types. Either way, we are on hand if you run into any questions or problems.