ORA Format Specification

Overall Design

Subdivision in Blocks

An ora file consists in a list of ora blocks. Each ora block contains up to 50000 reads and is self sufficient: all the information necessary for its decompression is contained within the block. There is therefore no overall file header containing additional information.

Reads in the ora file are ordered in the same way as the input FASTQ file format. If two paired FASTQ files are used as input, the reads are interleaved in the ora format (fragment 1 read 1, fragment 1 read 2, fragment 2 read1, fragment 2 read 2, ...).

This scheme has many advantages. First it makes possible the concatenation of compressed ora files: the concatenation of two lists of blocks is still a list of blocks and thus a valid ora file. Secondly it makes streaming possible, as soon as a single block is transferred its decompression can start. And lastly it makes parallel decompression easier, since each separate block can be handled independently from the rest.

Block Content

Figure 1 shows the content of an ora block. It starts with a header and is followed by a list of data sections. There are separate data sections for DNA sequences, read names, quality scores, sequence size, presence of ‘N’ characters in DNA sequences, and a crypto section possibly containing cryptographic keys. The quality scores are split into two different sections, one for sequences that contains at least one ‘N’ and the second one for reads that do not contain an ‘N’. This allows for different compression schemes for the two cases. The miscellaneous fields are for implementation dependent usage, as explained in a later section. Each data section can be of size 0 if unused. The block header contains in particular, among other metadata, the length in bytes of each of the data sections.

Figure 1: Content of an ora block.

It starts with a header, followed by a list of data sections. The header of the block, contains, among other metadata, the length of each of the data sections. The ora block contains all the information necessary for its decompression.

Data Types

In this document, the following data types are used:

Table 1: Data Types
Type	Description
uint8_t	unsigned 8 bit integer
uint16_t	unsigned 16 bit integer
uint32_t	unsigned 32 bit integer
uint64_t	unsigned 64 bit integer
int8_t	signed 8 bit integer
int16_t	signed 16 bit integer
int32_t	signed 32 bit integer
int64_t	signed 64 bit integer

Block Header

The ora block header is formally described in table 2. All multi-bytes values are stored in little-endian fashion.

Each ora block must start with this header, with the first two bytes of the header being the magic value 0x7C49. Note that the header size is not fixed in the format, but must include the list of following required fields. As of version 2.5.5, the header is 121 bytes, but may be larger in future versions or include additional implementation-dependent custom data. A compliant decoder should get the header size from the second field of the header. The header size includes the size of the magic number and 'header size' field.

Table 2: ORA Block Header
Field	Description	Type
magic	ora magic string 0x7C49	uint16_t
l_header	size of this header in bytes	int32_t
l_dna	size of the DNA buffer in bytes	uint32_t
l_names	size of the read names buffer in bytes	uint32_t
l_qualN	size of the quality buffer 1 in bytes	uint32_t
l_qual	size of the quality buffer 2 in bytes	uint32_t
l_size	size of the sequence sizes buffer in bytes	uint32_t
l_N	size of the N position info buffer in bytes	uint32_t
l_m1	size of the crypto buffer in bytes	uint32_t
l_m2	size of the miscellaneous buffer 1 in bytes	uint32_t
l_m3	size of the miscellaneous buffer 2 in bytes	uint32_t
flags	bitwise flags	uint32_t
l_read	read size	int32_t
n_reads	number of reads stored in this block	int32_t
version	ora version used to compress this block	uint16_t
b_id	block ID	uint64_t
q_type	type of quality scores	uint8_t
q4_1	Q4 quality symbol 1	uint8_t
q4_2	Q4 quality symbol 2	uint8_t
q4_3	Q4 quality symbol 3	uint8_t
q4_4	Q4 quality symbol 4	uint8_t
l_names_raw	size of uncompressed read names	uint32_t
l_DNA_raw	size of uncompressed DNA sequence	uint32_t
l_qual_raw	size of uncompressed DNA quality scores 2	uint32_t
l_qualN_raw	size of uncompressed DNA quality scores 1	uint32_t
l_qualTotal_raw	size of uncompressed DNA quality scores	uint32_t
c_time	epoch date of file creation	uint64_t
checksum_raw	checksum of the uncompressed fastq data	uint64_t
checksum_ref	checksum of the reference used for compression	uint64_t
checksum_comp	checksum of this ora block	uint64_t

Additional field information :

•

flags bitwise flags. List of flags is :

Bit	Description
0x1	all reads in block have same size
0x2	reads are interleaved, generated from pair of files
0x4	read names are not stored
0x8	read names are stored in tokenized mode (as described later)
0x10	read names stored in fallback mode
0x20	quality scores stored in fallback mode
0x40	DNA sequence stored in fallback mode
0x80	hybrid encryption mode activated
0x100	unused
0x200	unused
0x400	unused
0x800	unused
0x1000	original input file name stored in misc1 buffer
0x2000	input fastq file was gzipped

•

l_read

This field contains the read size in base pairs when all reads in block have the same size (as indicated by flag 0x1). In that case the “sequence size” section of the block is empty. If reads in block have different sizes, their sizes are stored in the “sequence size” data section.

•

n_reads

Number of reads stored in this ora block.

•

version

Version is stored as 10000 major + 100 minor + release, e.g. 20505 for version 2.5.5.

•

b_id

Identifier of the block, mainly used for debugging purposes. It is not guaranteed to be in consecutive or increasing order.

•

q_type

Quality type of the fastq file. May either be 4 for files containing at most 4 different quality scores, or 40 for all other FASTQ files.

•

Q4_1-4

When quality type is 4, the 4 fields Q4_1-4 contains the ascii characters of the 4 quality scores actually used in the FASTQ file.

•

c_time

Used to store the creation date of this ora block, as the value returned by the standard C library 'time' function. Note that the blocks of an ora file may have different creation dates (e.g. if a file was created as a concatenation of multiple files).

•

checksum_raw

Contains the checksum of the uncompressed data. It is computed with the xxhash function³.

•

checksum_ref

Contains the checksum of the reference genome file used for compression, computed with the xxhash function³.

•

checksum_comp

Contains the checksum of the block, computed with the xxhash function³. This checksum is computed from the block header and all the block data sections concatenated together, with the 'checksum_comp' field of the header set to 0.

Sequence compression format

Overall Description

ora is a reference based compressor. Reads are first mapped to a reference genome, and then encoded as a position in the genome plus a list of differences. The mapping algorithm is not a part of the specification, it is implementation dependent. Any mapper/aligner that provides basic information, e.g. such as read position on the reference genome, number and position of mismatches could be used. To be noted, the choice of read aligner plays an important role in the compression speed and compression ratio of an ora compliant compressor.

Sequences are encoded as a custom binary format defined below, which is then further compressed with an entropy coder.

DNA Sequence Binary Format

The sequence binary format is a list of read groups (i.e. an ora block contains a 'DNA sequence section' that itself consists in a list of read groups, with 'read group' defined below ). Each read group contains at most 8 reads. It starts with 4 bytes containing flag values for the following 8 reads, followed by 8 read records. The rationale for this grouping is to have easy to use byte aligned data: bit information per read is grouped with 8 reads to form a byte. Each read record contains the mapping data necessary to reconstruct the read from the reference at decompression.

The ora format currently does not use any kind of gapped information. Read alignments are therefore only limited to the following types:

•

Perfect: alignment of the whole read, 0 mismatch.

•

Global: alignment of the whole read, at least 1 mismatch.

•

Local: alignment of a part of the read only. ( also known as clipped alignment )

•

Unmapped : mapper was not able to map the read on the reference genome.

Read Group

One read group is defined as the following fields:

Table 3: ORA Read Group
Field	Description	Type
Perfect	perfect mapping	uint8_t
Forward	forward / reverse mapping	uint8_t
with N	read contains N	uint8_t
pos_t		uint8_t
read records 0 to 7		list of read records

Additional details:

• Perfect

One byte value. Bit i is set to 1 if i-th read of the group is perfectly aligned to the reference

(full-length mapping, 0 mismatch)¹.

• Forward

One byte value. Bit i is set to 1 if i-th read of the group is aligned to the forward strand, 0

if aligned in reverse.

• with N

One byte value. Bit i is set to 1 if i-th read of the group contains at least one ‘N’.

• pos_t

One byte value. Bit i is set to 1 if the field position of the i-th read of the group is encoded

with 16 bit mode as descirbed below.

• Read records 0 to 7

All read groups contains exactly 8 reads, except the last one of the ora block which contains 1 to 8 reads. The read group does not specifically store the number of reads encoded in the group, but this can be inferred from the total number of reads in this ora block stored in the block header.

¹with bit 0 denoting the most significant bit of the byte, and bit 7 the least significant bit.

Read Record

The read record specification depends on the alignment type.

Perfect alignment:

If the alignment is perfect (as defined in the read group), then the read record is defined as:

Field	Description	Type
position	position of the alignment on the reference genome	uint16_t / uint32_t

The position can either be an absolute position in 32 bit on the reference genome (with all chromosomes concatenated to form one virtual single sequence), or a 16 bit incremental position from the previous read. Alignments on a larger than 232 bp reference are not supported in this format. The coding mode for the read is specified in the read group, as specified via the corresponding bit in the pos_t field.

Non-Perfect alignment:

If the alignment is not perfect, the read record will start with an 8-bit flag field containing, among other things, the type of the alignment. The remaining alignment types are global, local, raw, raw4. The 8-bit flag is of type flag8_t described below.

flag8_t:

The 8-bit flag8_t field is defined as (with bit 0 denoting the most significant bit of the byte, and bit 7 the least significant bit): 1with bit 0 denoting the most significant bit of the byte, and bit 7 the least significant bit.

Bit	Description
bit 0-4	number of mismatches
bit 5	global alignment
bit 6	raw alignment
bit 7	raw4 alignment

Bits 0 to 4 contain the number of mismatches in the alignment, it is therefore a value in 0-31. Bit 5 is set to 1 if the alignment is global, 0 otherwise. The distinction between the three other types isspecified by bits 6 and 7. A local alignment would have bits 5, 6 and 7 set to 0.

Global alignment:

Global alignment denotes that the whole read is aligned to the reference genome. Its read record is defined as:

Field	Description	Type
flag8_t	bitwise flags	uint8_t
position	position of the alignment on the reference genome	uint16_t / uint32_t
mm_list	list of mismatch entries	variable size, list of mismatch_entry_t

The number of entries in the mm_list (defined later) is equal to the number of mismatches in the read indicated in bits 0-4 of the flag8_t.

Local alignment:

Local alignment denotes that only a part of the read has been aligned to the reference genome. Its read record is defined as:

Field	Description	Type
flag8_t	bitwise flags	uint8_t
position	position of the alignment on the reference genome²	uint16_t / uint32_t
mm_list	list of mismatch entries	variable size, list of mismatch_entry_t
l_left	length of unmapped left part	uint8_t
l_right	length of unmapped right part	uint8_t
raw_nt	raw unmapped nucleotides, encoded as 2 bit per nt	variable size

The raw_nt field contains the concatenation of unmapped left and right parts of the read. It is encoded as 2 bit per nucleotide, with values 0,1,2,3 coding respectively for A,C,G,T. The field size is rounded to the upper byte boundary, with the least significant bits of the last byte padded with 0. Figure 2 shows the encoding example of a local alignment.

²position in that case is the position on the reference genome of the first aligned bp of the read

Raw alignment:

In this mode the whole DNA sequence is encoded with 2 bits per nucleotide. It is mainly used for unmapped reads, but may be used in some other cases. 2 bit values 0,1,2,3 encode respectively for A,C,G,T.

Field	Description	Type
flag8_t	bitwise flags	uint8_t
raw_nt	raw unmapped nucleotides, encoded as 2 bit per nt	variable size

The 'raw_nt' field size is rounded to the upper byte boundary, with the least significant bits of the last byte padded with 0.

Raw4 alignment:

This is the same as before, but using a 4 bits per nucleotide encoding mode. This encoding mode is used for unmapped reads that contains Ns. Values 0,1,2,3,4 encode respectively for A,C,G,T,N.

Field	Description	Type
flag8_t	bitwise flags	uint8_t
raw_nt	raw unmapped nucleotides, encoded as 2 bit per nt	variable size

The 'raw_nt' field size is rounded to the upper byte boundary, with the least significant bits of the last byte padded with 0.

N list:

For reads that contain at least one ‘N’, as defined by the flag in the read group, then the read record contains at its end an additional field defining the list of position of ‘N’ in the sequence. It is defined as:

Field	Description	Type
nb_N	the number of N stored in the list	uint8_t
list_p	list of N positions	variable size, multiple of uint16_t

The format imposes a maximum value of nb_N = 31. Reads that would require a larger number are instead encoded in the raw4 mode.

Each position in the list is a 0-based position in the read defined as a an incremental offset from the previous N position. For values below 215 the offset is coded as a 16 bit unsigned value. For offsets larger than or equal to 215 then multiple uint16_t are emitted, which are summed to produce the desired total offset.

mismatch_entry_t:

The mismatch entry type is a single byte defined as3 ³:

Bit	Description
bit 0-5	incremental position of the mismatch
bit 6-7	alternate nucleotide

The mismatch position is a 0-based position in the read defined as an incremental offset from the previous mismatch position, or from position 0 for the first mismatch of the list. The alternate nucleotide is a 2-bit value defining the nucleotide mismatch, with values 0,1,2,3 coding respectively for A,C,G,T.

³with bit 0 denoting the most significant bit of the byte, and bit 7 the least significant bit.

Entropy coder for the DNA sequence data section

The DNA sequence data section consisting of the list of read groups as defined previously is then compressed with the zstd library⁴. Compression level of the zstd library is implementation dependent.

Fallback mode

DNA sections may alternatively be compressed in a Fallback mode as specified in the bit 0x40 of the flags of the block header. In this mode, all DNA sequences are concatenated together and compressed with zstd library, with an implementation dependent level.

Quality Scores Compression Format

Overall description

Quality scores are compressed in a completely lossless way. The overall scheme is to apply some transformations, then use a context model followed by a range encoder. The transformations and context models used are adapted to different kind of quality encodings. ora distinguishes between quality scores generated by sequencers that generate quality on 4 different levels, and other type of sequencers that generate more levels. These two types are respectively called Q4 and Q40 in the ora format.

Among a type, we further distinguish between reads that contain at least one ‘N’ and other reads. The idea is to further exploit specificities of the encoding: for Q4 qualities, the 0 quality value is usually reserved for ‘N’ in the sequence. Therefore if the read does not contain an ‘N’ then its quality score string only contains 3 different levels.

Q40 Quality Scores

This quality mode applies to quality strings that contains a range of at most 64 different quality values.

Binary format

For Q40 quality scores, the first transformation applied during compression is to replace some frequent trimers by a single 1 byte code.

At compression, the 190 most abundant trimers (words of 3 quality values, i.e a word Q₁Q₂Q₃) are computed from all the overlapping trimers of quality strings in the ora block. The data section starts with a table of size 190 x 4 Bytes containing the trimer value of each of the 190 most abundant trimer used. The trimer value is defined as (Q1–Qlowest)+(Q2–Qlowest)x64+(Q3–Qlowest)x64². Each trimer value is also mapped to trimer code in the range [64-254], as ordered in the table. Qlowest is the lowest quality value observed in the block, and is stored right after the 190-table as a single byte. Each quality value is transformed as follow :

•

If quality value is among an abundant 190 trimer:

Q₁Q₂Q₃←trimercode[Q₁Q₂Q₃]

•

else

Q←Q–Qlowest

After transformation, the string of quality values is a list of bytes representing either a single quality value ( if byte value in the range 0-63), or a word of 3 consecutive qualities (byte value in the range 64-254).

The beginning of the quality data section is in that case:

Field	Description	Type
190-table	trimer codes of the 190 most frequent trimers	190 x uint32_t
lowest_val	lowest quality value in the block	uint8_t

Q4 quality scores

The Q4 type applies to quality strings that contain at most 4 different quality values. The quality strings of each read is further divided into two groups, one group for strings that do not contain the 0 value (and hence only have 3 different quality values), and the rest.

The four different quality values originally in ASCII range are converted to four values in the 0-3 range. The block header contains in the q4* fields th association between the 0-3 range and the four corresponding ASCII values.

Fallback mode

Quality scores may alternatively be compressed in a Fallback mode as specified in the 0x20 bit flag of the block header. In this mode, all quality scores of the data section are concatenated together and compressed with zstd library, with an implementation dependent level.

Read names compression format

Overall description

Read names are tokenized based on standard C lib isdigit function. The read name is split into substrings containing only digits or only non digit characters. Read names of an ora block are encoded in a tokenized fashion if all read names in the block have the same number of tokens. They are otherwise encoded in the fallback mode described below.

Then i-th token of each read name are concatenated together into a token set. Digit tokens are binarized and then encoded as incremental value from the token of previous read name. Each token set is then compressed using zstd library⁴.

Binary format

The read names section is defined with the following fields :

Table 4: ORA Read Group
Field	Description	Type
nb_tokens	number of token sets	uint32_t
token_type_list	list of types of each token set	uint8_t
token_set_size_list	list of size of each token set	uint64_t
token_sets_list	list of all token sets	variable size

•

nb_tokens The number of token in read names. This means there will be nb_tokens token sets stored in the token_sets_list. token_type_list Type of each token set. Types are encoded as a value in the 0-5 range corresponding to types : string,numeric,numeric_incr, numeric_incr16, numeric_incr8, numeric_ incr32

•

token_set_size_list The size of each token set in bytes, encoded on 64 bit unsigned integer.

•

token_sets_list The list of all token sets. Each token set is compressed according to its type. String tokens consists in the concatenation of this token for all read names of the group then compressed with zstd library. numeric_incr tokens are encoded as differential value from token in the preceding read, and encoded with 8, 16 or 32 bit unsigned integer. The value for the first read is encoded as a 64 bit signed integer.

Fallback mode

If read names in the block do not all have the same number of tokens, then read names are compressed in this fallback mode. In that case read names are concatenated together, with each read name delimited by a null byte. All printable ASCII characters are allowed in read names. It is then compressed with zstd library, with level implementation dependent. The mode used to compress read names is recorded in the ora block header, with bit flag 0x10.

Sequence size data section

This data section encodes the list of DNA sequence sizes stored in the DNA sequence data section in the case they are not all the same in this block. If they are all the same then this section is empty and read size is stored in the block header once.

The sizes are encoded by group of 8 size values called a sequence size group. Each group starts with the flags:

Table 5: ora sequence size group
Field	Description	Type
same_flag	same size as previous read	uint8_t
ize_flags	defines the encoding type of the size	uint8_t

•

same_flag

One byte value. Bit i (with bit 0 the most significant bit) is set to 1 if i-th size in the group is equal to the previous value.

•

size_flag

One byte value. Bit i is set to 1 if i-th size of the group is encoded on a single byte. After the two flags values, each size is encoded according to its type:

•

same size In the case the same flag is set, then no size is stored, it equals the previous one.

•

size < 256 If size_flag is set to 1, then the read size is encoded as a single byte value.

•

size > 256 if both the same_flag and size_flag are set to 0 then the read size is encoded as a list of 16 bit unsigned integers. The size S is decomposed with the euclidian division S = QK + R, with K = 32768, and then stored as a list of Q+ 1 16 bit unsigned integers: Q times value 32768 and one time value R. This is a kind of prefix code tuned for small sequences.

The data section is then further compressed with zstd library, with implementation-dependent level.

Encryption

Overview

The ora format optionally supports encryption of the file with an hybrid scheme. Data sections are encrypted using a symmetric cipher, whose encryption key is stored in the file after encryption by an asymmetric cipher.

From the user perspective, the file is encrypted with an asymmetric scheme: a public key is used for encoding the file, and the associated private key is required to decipher and read the file. Internally, a new key K is randomly generated for the symmetric cipher for each new ora block, and this key is encrypted with the asymmetric cipher with the user provided public key and stored in the block. Data sections of the block are encrypted with the randomly generated key K. When reading the file, the key for the symmetric cipher is deciphered with the user provided private key, and use to decipher the data section with the symmetric cipher.

This scheme allows for the convenience of an asymmetric scheme for the user, while retaining the high speed of a symmetric cipher for the encryption of the file. The symmetric cipher used is AES with a 128 bit key, and the asymmetric cipher is RSA with 4096 bit key. Implementations of RSA and AES are those provided by the CryptoPP library¹.

Cipher keys

When the file is encrypted, the “crypto data” section hold the following data :

Field	Description	Type
RSA-digest	fingerprint of public RSA key	32 bytes
magic	ora magic string 0x7C49	uint16_t
aes	AES key	uint128_

Fields “magic” and “aes” are stored encrypted with RSA with the user-provided public key. A 32 bytes RSA-digest of the RSA public key used is stored in the file for user convenience. The ora magic string encrypted with RSA may be used as a quick test when trying to decrypt an encrypted ora file to check if the user provided private key is the correct one.

Encryption

When encryption is activated all data sections are individually encrypted with AES with a 128 bit key. The header of each ora block is not encrypted and contains in particular a flag specifying if the block is encrypted or not. Each block is encrypted with AES in CTR mode with a 128 bit initialization vector. Each data section has its own randomly generated initialization vector that is stored at the end of the data section.