Wednesday, March 10, 2010

Reading Binary Files

To handle DNA data coming out of old and new generation sequencing machines, I needed to learn how to read the binary SCF files that are typically output by the software these machines use. To do so, I needed to know the byte format of these files, and I had to translate from C definitions of the data types to IDL data types.

The byte format page describes in detail how the bytes are ordered. There's a header, some comments, base calls, and traces. The base calls include probabilities and a cross-reference to the trace position where the base was called. The traces themselves are stored as consecutive differences as opposed to absolute values, so translating the SCF traces to something you're used to is non-trivial.

The biggest trick in this was figuring out how to use IDL's READU function, the OPENU function, the latter with the SWAP_ENDIAN keyword set. The byte-ordering (at least on the 3.00 version SCF files I used) is different from what IDL expects. Also note that READU shifts the file pointer according to the data type of the variable you'll be reading in, something I found out the hard way and later read in the documentation.

I found the POINT_LUN,unit,pos function was most useful for shifting and tracking the file pointer. Defining a variable = -1 * unit (i.e. the negative of the logical unit) makes POINT_LUN return the current position.

No comments: