One of my favourite parts of the Digital Preservation Workbench we launched at iPRES 2024 is the 'Format Diversity Estimation'. It's based on the realisation that we could apply approaches from the study of ecological species diversity to the digital format ecosystem, and use them to estimate how many formats are out there awaiting analysis. This matters because identifying the format of digital resources a crucial step towards understanding the information and software dependencies we need in order to make future access possible.

Digital Formats Species Accumulation Curve

The analysis showed there are likely to be at least 12,000 distinct formats in the digital world. The PRONOM digital preservation format registry that many of us depend on, even after twenty years of support and investment by the UK National Archives and contributions from a wide range of international collaborators, covers just 1,700 of them.

We've come a long way, but there's clearly also a long way to go.

This becomes even more challenging when combined with the evidence from collecting format profiles from different institutions. Not only does almost every institution hold a handful of formats that appear to be unique to them, they also have their own priorities for which ones need to be identified most urgently. This is often driven by contextual factors (like donor requirements) and not by which formats are the most numerous or widespread.

The scale and breadth of this challenge means no individual service or product can solve it for us. Instead, we need to support, join and grow the community of people who are able to identify and document digital formats. The more of us have the capability to respond to this challenge, for more chance we have of meeting the dual challenges of the size of our global digital ecosystem and the urgency of our local priorities and needs.

So, this World Digital Preservation Day, I'd like to encourage you to explore this issue and join us as we try to map out our digital ecosystem.

In particular, I would like to encourage anyone with any interest in this issue to take part in the 2024 PRONOM Research Week Hackathon, which begins today! There are lots of different things you could do to help out, and there's a fortnightly Teams to support you if you want to join in.

The kick-off call is at 1600 GMT today!

When the time comes, use this link to join the Teams call. The folks who run the calls are welcoming and kind, and you don't have to be able to code or willing to stare at raw bytes in order to take part!

Although they might try to convince you that staring at bytes is actually rather good fun and you should try it...

So, for a little while, let's let go of the safety and comfort of those familiar formats. The PDFs, the TIFFs and DOCXs, the JPGs and, yes, even those pesky HTMLs. And join us as we explore the strange gems and rare jewels in the long-tails of our file format distributions.

After all, these weird wonders also deserve our care.

Comments

Andrew Jackson
1 month ago
Quoting Andy:
It looks likely to me that there will be many more than 12000, especially if you ignore the 0,0. It could be helpful to see multiple points for each dataset too.

How easy it is to parse and extract information contained in a file is a slightly different issue. For instance, we can get structured data from a file with a “pdf” extension, but might also get a picture of data in a structure that requires more processing to get into a structured form.

I wonder how different the format of files with the same file extension might be.

Thanks :)


Yes, I agree, there's likely to be many more. The goal here is just to establish some kind of credible lower-bound on the total.

We're hoping the data we're making available via the Workbench will make further analysis possible in the future!

Thanks!

Andy
Quote
Andy
2 months ago
It looks likely to me that there will be many more than 12000, especially if you ignore the 0,0. It could be helpful to see multiple points for each dataset too.

How easy it is to parse and extract information contained in a file is a slightly different issue. For instance, we can get structured data from a file with a “pdf” extension, but might also get a picture of data in a structure that requires more processing to get into a structured form.

I wonder how different the format of files with the same file extension might be.

Thanks :)
Quote

Scroll to top