Jesse Dyer is Digital Archivist at University of Melbourne Archives.
Monitoring the wide variety of formats in the University of Melbourne’s digital preservation repository, Preservica, is undertaken by the Digital Stewardship (Research) team. I recently had the opportunity to work with colleagues in this team on identifying file format signatures.
Identifying file formats by matching their signatures to those in the PRONOM registry is far more reliable than using the extension alone. Accurate identification not only facilitates preservation, it also informs our description of born-digital material at the University of Melbourne Archive.
By submitting any signatures we identify to the PRONOM registry, other collections and preservationists will likewise be able to benefit from our efforts and accurately identify the same types of files.
One file format in our collection that I found particularly interesting to investigate is ‘.oma’ which was used by Sony MiniDisc players.
Two examples of ‘.oma’ in our collection are from the born-digital audio recordings made by Germaine Greer. She made this field recording of bird song, and an interview recording about the Book of Psalms on MiniDisc.
MiniDisc
Developed by Sony as an alternative to the compact disc (CD) for portable music players, MiniDisc had a relatively short period of popularity before it was edged out of the market by hard disk and flash-memory based music players such as the iPod.
Because of this short lifespan, and especially because of its proprietary design, MiniDisc presents a number of challenges to digital preservationists. Sony designed the file format for MiniDiscs to include digital rights management (DRM) in order to control the transfer or duplication of published recordings. Thankfully, the MiniDiscs we hold in our collection at the University of Melbourne Archive do not contain DRM.
Sony was able to achieve with MiniDisc, an audio quality which is perceptually similar to CD. Despite being a smaller disc, they also have the same recording duration. This was enabled through the development of the Adaptive Transform Acoustic Coding (ATRAC) compression algorithm – which, you guessed it, is also proprietary.
A further complicating factor in the preservation ‘.oma’ files is that the files can contain audio data encoded as either ATRAC or pulse code modulation (PCM).
Signatures
In our initial research we gathered information about the structure of the format. In order to identify the format’s signature, we used a hex editor with a compare function (currently my favourite is dhex). This allows two files to be viewed side by side and highlights any identical byte sequences. We found that, although the number of bytes in the header varied, the header and audio data sections of the file each began with an identifiable and unique signature.
Using Ross Spencer’s signature development tool we created a custom signature file for DROID. We gathered a selection of ‘.oma’ files from different sources to test our signature against to ensure they all matched. We also tested a range of other files to ensure there were no false positives.
Unfortunately, not all file formats have an identifiable signature like the ‘.oma’ format does. Some of the other formats in the repository that we investigated could only be identified by the file extension.
In the process of collecting and preserving born-digital material at the University of Melbourne Archive we are likely to encounter more of these uncommon formats which are not yet registered in PRONOM. Proactively researching these formats is important not only in preserving our own collection material, it is a way of contributing to the wider digital preservation community.