Matthew Addis is Co-Founder and CTO of Arkivum based in the UK.
What do we mean when we talk about data integrity? What is data integrity and why is it an important part of digital preservation? Data integrity means different things to different people. If you ask the question to an IT professional, compliance officer, corporate archivist, research librarian, or curator of a special collection then you will typically get very different answers. What we mean by data integrity depends on what we mean by data, how and why that data exists, and most importantly who’s using it and for what purpose. There is no single answer. But in all cases, digital preservation has a role to play in achieving data integrity, and that makes life in the digital preservation world both challenging and interesting!
If you come at the question of ‘what is data integrity?’ from an IT angle, then data integrity is sometimes seen as synonymous with preventing data corruption and loss – either accidental or deliberate. Data integrity, from this limited perspective, is about keeping your bits and bytes in good condition, which could be files on a local server, objects in the cloud, or records in a database. Approaches such as information security are used to prevent data being changed, destroyed or lost in an unauthorized or accidental manner. Approaches such as checksums, fixity checks, replication and backups are used to measure, detect and recover from data corruption or loss should it occur. Together, these help to achieve data integrity at the bitstream level. This is a fundamental building block of digital preservation and is widely seen in good practice, for example within the NDSA levels of digital preservation, especially in the Storage and Integrity areas, in the DPC RAM, in particular as part of Bitstream preservation, and in CoreTrustSeal, for example within the requirement for Data integrity and authenticity. Now, to be fair, these examples of good practice support a wider definition of data integrity, and they of course cover many other aspects of digital preservation too. The scope of good practice also evolves over time, for example it is interesting to note that the NDSA levels v1 explicitly talked about ‘File Fixity and Data Integrity’ but this has since been generalised to ‘Integrity of Content’ in v2. But nonetheless, the NDSA levels, DPC RAM and CTS, all include, and build upon, data integrity as something that is about data corruption or loss. The consider integrity as is something to be achieved, at least in part, by using technological approaches such as digests, signatures and fixity checking tools. For example, the CTS glossary defines Integrity as “Integrity: Internal consistency or lack of corruption of digital objects. Integrity can be compromised by hardware errors even when digital objects are not touched, or by software or human errors when they are transferred or processed.”. The entry for Integrity in the NDSA glossary says “see Fixity Check”, which then goes on to say that Fixity Checks are “A mechanism to verify that a digital object has not been altered in an undocumented manner. Checksums, message digests and digital signatures are examples of tools to run fixity checks. Fixity information, the information created by these fixity checks, provides evidence for the integrity and authenticity of the digital objects…”. The DPC glossary has no entry at all for Integrity, but it does mention the word once under Fixity Check “Fixity Check a method for ensuring the integrity of a file and verifying it has not been altered or corrupted…”. It is interesting that these are all relatively focussed on the technical aspects of data integrity. Not that there is anything wrong with that of course!
If you come at the question of ‘what is data integrity?’ from the perspective of an archivist managing the records of drug development and manufacturing who needs to support regulatory inspections, then the answer is quite different. Well, superficially at least it looks to be different. For example, the MHRA guidelines on GXP data integrity say that Data Integrity is “the degree to which data are complete, consistent, accurate, trustworthy, reliable and that these characteristics of the data are maintained throughout the data life cycle. The data should be collected and maintained in a secure manner, so that they are attributable, legible, contemporaneously recorded, original (or a true copy) and accurate. Assuring data integrity requires appropriate quality and risk management systems, including adherence to sound scientific principles and good documentation practices.”. Data in this context is defined to be “Facts, figures and statistics collected together for reference or analysis. All original records and true copies of original records, including source data and metadata and all subsequent transformations and reports of these data, that are generated or recorded at the time of the GXP activity and allow full and complete reconstruction and evaluation of the GXP activity”. In this case, data integrity starts from day zero when data is first born. It includes raw data, derived data, metadata, logs and anything else that forms a record of what was done at the time the data was created and from that point onwards. The purpose of data integrity is also stated – “to enable a full and complete reconstruction of what was done.” This is something that can be done with the data, e.g. proving that a drug trial was done correctly, and is not just a property of the data itself, e.g. proving that a file hasn’t been corrupted. In the MHRA definition, data integrity also involves a set of processes as well as needing to support an outcome. And data integrity needs systems, but these are in the sense of governance and risk management. This goes a whole lot further, but crucially also builds upon, the more restricted and technology-oriented definition of data integrity that I started with in this post. Only by building on the technical foundations of data integrity in the IT sense of the term can you go on to achieve data integrity from a regulatory perspective.
If you come at the question of ‘what is data integrity?’ from the perspective of an a researcher who uses or produces data as part of their research, or perhaps from the perspective of a research librarian who is tasked with providing research data to a community, then the answer is different again. There isn’t a clear definition of data integrity in this context, but undoubtedly data integrity is an important part, and result of, the wider area of Research Integrity. Data integrity is also part of ensuring that research data is Findable, Accessible, Interoperable and Reusable (FAIR). FAIR promotes high-quality research that follows good research practice where the results are repeatable, verifiable and re-usable – which clearly depends on data integrity. Only then can research data be used with confidence and be exploited to its full potential. Maintaining data integrity is also an implicit part of the operations of a trusted research data repository that follows the principles of Transparency, Responsibility, User focus, Sustainability and Technology (TRUST). FAIR applies to data and TRUST applies to repositories that hold data. Both involve data integrity. But you can’t reuse data from a research data repository (or any other source for that matter) if you can’t trust it. And you can’t trust data if you don’t know its authenticity, completeness, correctness, who created it and how they did it. And if you are reusing research data created in the past, for example from a TRUST repository, then this is all pointless if data integrity isn’t maintained over time in that repository and isn’t provable. And that requires digital preservation. The answer to the question of ‘what is data integrity’ is that data integrity is an important aspect of ensuring research data is trustworthy and reusable. Admittedly this is more of an outcome than a definition, but sometimes the objective is more important to define than the means.
Where does all this discussion on data integrity leave us? Three ‘answers’ to the question ‘what is data integrity’ from three different perspectives. There is a lot of commonality even if the terminology is different in each of these domains. In all cases, data integrity is both a set of processes and a desired outcome. The purpose of data integrity might be different (a compliance inspection v.s. reuse of data in open science), but the methods and approaches are similar. And above all, the ‘data’ in ‘data integrity’ means more than just files or bitstreams, it means content in general including metadata, provenance, audit trails and other contextual information such as how data is organised and arranged, how it was created and processed, and how it can be checked and verified. Data is complex and so too is data integrity. And this will only get more complicated and challenging as the types of data that need to be preserved evolve and grow – especially as data becomes ever more entwined with the software and services used to create and use it. The boundaries between data and applications is increasingly blurred and the need for integrity applies to both. New tools and techniques will be needed to test, record, verify and recover data integrity in the broadest sense. There is plenty of research that will be needed into achieving data integrity for complex content. New services and solutions will surely emerge in the marketplace as a result. It seems to me that ‘data integrity’ is a worthy, universal, interesting and challenging part of digital preservation that will surely keep many of us busy for years to come.