Matthew Addis is Co-Founder and CTO of Arkivum based in the UK.
Email Monkey Magic
Email preservation is one of those areas that covers almost every digital preservation issue in the book. This blog post describes my journey into the world of email preservation - what I learnt, what I did, and what we've now built into Arkivum's Perpetua solution. To be honest, it did at times it feel more like the trials of Monkey in Journey to the West but I got to do some cool email magic on the way!
When one of our customers asked me to define and implement an email preservation strategy for them in Perpetua, it immediately threw up a whole slew of thorny questions such as 'what sort of emails do you need to keep and how are you going to select them?', 'how do you want to deal with privacy and sensitive content?', 'what significant properties of your emails need to be maintained?', 'should I go for migration or emulation based preservation - or do I need both?', and 'do you want to discover, search and access email alongside all your other archive content, or do you want a dedicated environment just for email?'. The list got long and very quickly.
There's lots of guidelines and suggestions out there for email preservation
Including the DPC tech watch report on email preservation (I was lucky enough to read the preview), the great work of the Email Archives Task Force, and some really good DPC workshops such as 'Email Preservation: How Hard Can it Be? These are great resources, but they do reveal that there is no single solution to email preservation and there are no universally accepted standards or approaches either. Nothing new there then, this is the world of digital preservation!
Being a hands-on techie geek kind of guy, I immediately looked at possible technical solutions too. Could we use a dedicated email archiving system such as ePADD from Stanford? Could we try virtualising conventional email software such as Outlook and embedding that into our solution to render/display emails? Could we add support for email into general a purpose Archive Information Management systems such as ArchivesSpace or AtoM? Could we send email off to a commercial service for email archiving? Could we look to the world of digital forensics and e-Discovery ? Could we leverage the digital preservation technologies we already embed into the Arkivum Perpetua solution, e.g. Archivematica?
The best approach depends on what you want to achieve with email preservation.
Back to the customer on that one. Especially considering the question of how they wanted email preservation to fit alongside what they were already doing with other content types (images, videos, web archives, documents etc.). I then did what any self-proclaimed digital preservation professional would do and I followed the usual dogma of looking at significant properties (helped by the round-up of email in significant significant properties), analysing who would be using the email archive and how (designated community), and what file formats would be suitable for both preservation and access (OAIS AIPs and DIPs).
The conclusion we came to for the preservation format of email was conceptually simple – convert everything into a widely used open email format (EML), extract any attachments (e.g. images, docs, ppts etc.), and preserve/store the emails and attachments as files and folders on a file system. In the case of multiple emails in a PST file, we create an mbox version as well as unpacking the PST into individual EML files in a folder tree that represents the PST structure. Implementation was done by extending Archivematica’s extract packages function (normally used for unpacking zips, tars etc.) so it would also extract emails. For example, we recursively unpack PSTs into EMLs, unpack each of those to extract any attachments, and if those attachments happen to be more emails or other types of package such as zips then those get unpacked too. The benefit of this approach is that we can implement simple rules (extract PST to EML, convert MSG to EML, extract EML attachments etc.) and then Archivematica will recursively apply these rules and chain them together. Everything then goes through further file format identification, characterisation and optional file format normalization. This means that all the email file attachments can also get fully processed and preserved. For example, we also extended Archivematica so that documents and presentations attached to emails will get converted to PDF/A. Handling email attachments is an important part of email preservation. Have a look at the Enron email dataset as an example – this contains over 1 million emails with attachments in over 200 different file formats!
It would be wrong to imply that the journey was plain sailing.
Some aspects were far easier than I expected, e.g. extracting PSTs into individual EML files, which can be done using off the shelf tools such as readpst. Other aspects were harder. Especially MSG emails, which is an Outlook binary format. Our customer has over 2 million of these in their EDRMS that they want to export and preserve. Oh joy. Whilst EML is simple and contains text or HTML versions of an email body plus MIME typed attachments, MSG is much more complicated. Microsoft, in their infinite wisdom (read twisted and evil), do things like encapsulate the HTML body of an email in RTF, compress the RTF, and embed the compressed RTF in the MSG which is itself a Compound File Binary Format. Therefore, converting to MSG to EML took some work (and a fair bit of python code if you don’t want to use proprietary tools)!
It can be a bit of an effort getting everything converted to EML - in particular making sure the HTML email body maintains the right fonts, character sets, layout, inline images etc. But when done, EML is very simple to deal with. Validation checks are also easy because EML is ubiquitous and can be readily opened and eyeballed in Outlook, Thunderbird, Mac Mail and other clients. We always keep the original emails in their original format so the EML version isn’t a full replacement. What it does do however is provide a single canonical email format to convert to, and then, as I’ll talk about below, also to convert from in order to create access versions. It’s kind of like a pivot format for email that everything goes to and from.
The decision on the access version of email was interesting.
The customer wanted to be able to search/view emails in the same environment as all their other archive content (images, videos, documents etc.) and use the same approach to organising their content into archival descriptions using ISADG. They wanted to be able to view emails through a web browser with no need for client-side software, but also for archive users to be able to download emails as documents that were clearly identifiable as an archive version. This meant using a separate email environment such as ePADD wouldn’t work and neither would using EML as the access version of emails. The decision we made was to convert emails into PDF/A for access and make them accessible through AtoM, which is also integrated into our Perpetua solution. This means emails could be arranged/described/searched/navigated alongside everything else that the customer had in their archive. Having created EML versions of emails with HTML bodies, the process of creating PDF/A for access was one of extracting the HTML and any inline images and then doing a HTML to PDF conversion. We top and tail the HTML body with the email header info, e.g. the usual suspects of ‘to’, ‘from’, ‘date’, ‘subject’ at the top and then all the other email fields, which are often hidden by email clients, at the bottom. This means the ‘access’ version of the email is very transparent and everything also gets full text indexed by AtoM.
An example is shown below. This is the access version of a PST file from the dataset used in the ‘Great Digital Preservation Bake Off’ at iPRES this year. The PST structure and one of the emails with its attachments are on the right, and the PDF version of email is on the left. You can navigate the hierarchy and see thumbnail carousels of the emails and attachments.
Conclusion
I wouldn’t claim that our approach to email preservation and access is perfect or that it’s suitable for all occasions. There are other interesting alternatives to consider. For example, I’m particularly interested to see how EaaSI could be used to virtualise and run email software. For example, users could view old email formats in an authentic way and experience them in contemporaneous email clients. Likewise, dedicated email environments such as ePADD have the potential to enable more sophisticated search/navigation of email collections as a curated corpus rather than individual files or mailboxes. The RATOM project looks like it could develop some very useful tools for the review, assessment and triage of emails, especially when dealing with sensitive content and the need for redaction. Email preservation with all its options and choices is also a perfect fit for the Preservation Action Registries project (PAR) of which Arkivum is a member. The PAR specification and PAR API would allow what I’ve talked about in this post to be formalised in a way that could be easily shared with and built upon by the community.
Email preservation is certainly an interesting adventure and I’m looking forward to plenty of new lands yet to be explored!