Government information relies heavily on the PDF format. Indeed, PDF files are used so widely that it is tempting for us to assume that PDF files are, by definition, a safe way of preserving information for the long term. Would it surprise you to learn that the Digital Preservation Coalition (DPC) lists PDF files as “Endangered” and that even PDF/A files are “Vulnerable”?
These judgements are in the DPC’s newest list of “digitally endangered species.”
- The ‘Bit List’ of Digitally Endangered Species. Digital Preservation Coalition (2019).
The Bit List includes 74 content types and groups them into in 5 categories of risk (Lower Risk, Vulnerable, Endangered, Critically Endangered, and Practically Extinct). It provides specific comments on the risks associated with each of the content types and also outlines good practices that are necessary to address the challenges of preserving these content types.
The comments say that external dependencies, significant diversity of data, poorly developed digitization specifications, lack of integrity checking, lack of virus control, and lack of validation at the point of creation can raise the risk to the level of “Critically Endangered.”
The comments note that the PDF/A format, which is designed to reduce dependencies and thus curtail preservation risks, “is not sufficient to ensure preservation and users are warned against complacency.” It continues:
PDF/A has sometimes been misunderstood or misrepresented as a generic solution to all digital preservation requirements, whereas in the eyes of the judges it can only offer a preservation solution when embedded within a wider preservation infrastructure.
Vulnerability also depends on if the PDF file conforms to the specific PDF/A standard or not. This is caused by a combination of 1) not conforming to the standard and 2) collection managers assuming that the file is resilient simply because it purports to be a PDF/A. This risk is less with format and more with the understanding and experience in data management.
Of particular concern is the often overlooked option in the PDF/A standard that allows a PDF/A file to have other formats embedded within it. Thus a PDF file that conforms to the PDF/A specification may include Word and Excel files or other more obscure file formats as attachments.
There is much of interest to those managing digital collections in the Bit List. Most of the items listed are not specific technologies but broad categories of types of content such as Cloud Storage, Orphaned Works, and “Semi-Published Research Data.”
Of particular note to government information specialists is the listing of “Records of Local Government” in the CRITICALLY ENDANGERED category.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Latest Comments