Abstract – Evaluating file formats is an integral component of risk assessment for digital content. Understanding a file from a technical standpoint is important but equally so is understanding the institution’s ability to preserve it for the long term. In other words, all risk is local. But how to assess that risk in a meaningful and consistent way across widely diverse collections? There is no one size fits all approach. Instead, institutions of all sizes can assess format risk in a manner that is appropriate for their organization. Two U.S. federal government agencies, the National Archives and Records Administration (NARA) and the Library of Congress discuss their approaches to risk assessment through digital file format assessment including the similarities and differences in their approaches, which were designed to meet the different goals of their institutions. The Library of Congress utilizes a number of mechanisms including the Recommended Formats Statement, which has a defined and publicly available evaluation criteria matrix, as well as research from the Sustainability of Digital Formats. NARA issues Transfer Guidance that identifies preferred and acceptable file formats for the transfer of permanent records for preservation, as well as guidance for federal records managers about specific categories of records, such as email or social media. NARA’s guidance is based on a risk assessment process for individual versions of formats, which is shared publicly through the NARA Digital Preservation Framework. Both NARA and Library of Congress’ approaches were born from a shared framework which diverged over time to meet each institution’s goals and needs.
Keywords – Digital Preservation, File Formats, Risk Assessment, Preservation Planning
This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Dr. Özhan Saglik, Chris Prom and 2 anonymous reviewers. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.
Evaluating file formats is an integral component of risk assessment for digital content. An institution must not only understand a file format from a technical standpoint but also evaluate its unique ability to preserve the format for the long term, given local constraints and resources. Institutions of all sizes can assess format risk in a manner that is appropriate for their organization. In this paper, two U.S. federal government agencies, the National Archives and Records Administration (NARA) and the Library of Congress will discuss their approaches to risk assessment through digital file format assessments, which share a common framework of seven sustainability factors. These two institutions, while both located in the U.S., differ in their regulatory contexts, collecting scopes, technical infrastructure, and organizational structure. This paper will present the similarities and differences in these approaches, which were designed to meet the different goals of their institutions, demonstrating how the assessment of all risk is local and context-dependent.
The Library of Congress and NARA recognize the value of developing and maintaining adaptable models for assessing file format risks. While both start from a common base, each institution has expanded upon and transformed the assessment model to meet its own institutional needs. The shared starting point is the seven sustainability factors defined in the Sustainability of Digital Formats [1], which frame file format evaluations within our differing local contexts:
Disclosure, the degree to which complete specifications and tools for validating technical integrity exist and are accessible to those creating and sustaining digital content
Adoption, the degree to which the format is already used by the primary creators, disseminators, or users of information resources
Transparency, the degree to which the digital representation is open to direct analysis with basic tools, including human readability using a text-only editor
Self-documentation, the degree to which digital objects contain basic descriptive, technical, and other administrative metadata
External Dependencies, the degree to which a particular format depends on particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in the future
The degree to which the ability of archival institutions to sustain content in a format will be inhibited by Patents and Licenses
Implementation of Technical Protection Mechanisms such as encryption that negatively impact and prevent the preservation of content by a trusted repository
In preparation for a major update of its Transfer Guidance to U.S. government agencies in 2014 [2], NARA examined the file formats in its holdings and gathered information about formats that agencies indicated would be transferred to NARA in the future. To undertake this examination, NARA required a structure for the comparisons, creating a quantified Transfer Format Suitability Matrix with 37 data points covering the sustainability of proposed “Preferred” and “Acceptable” file formats.
Similarly, the Library of Congress created a format evaluation template which comprises global and local factors to support a consistent and flexible analytical structure for clearer definitions of “Preferred” and “Acceptable” when categorizing digital file formats in the Recommended Formats Statement. Along with the shared sustainability factors, these rubrics have laid the foundation for format evaluation at each institution. Each institution’s approach to format analysis, local context guiding analysis, challenges, and future directions are further described below.
The Library of Congress introduced its Recommended Formats Statement (RFS) [3] in 2014 as part of its vision of future collection development, emphasizing digital content at least as much as physical content. By offering parameters for creative works that would encourage preservation and long-term access, the RFS gave staff at the Library of Congress helpful guidance in their acquisition of collection material and, ideally, offered similar guidance to creators, vendors, and distributors as well.
To ensure its ongoing usefulness, the RFS is updated yearly on a consistent schedule with “content teams” composed of experts [4] in each category of creative content [5]: textual works, still image works, moving image works, audio works, musical scores, datasets, GIS, geospatial and non-GIS cartographic, design and 3D, software and video games, web archives, and email. These content teams are really the mainstay of the RFS because they interact with the communities of practice and work with the digital files on a regular basis. They see the real-world implementations of the format in use and their input into the evaluation and risk assessment is invaluable. Each year, RFS content teams gather to discuss the previous years’ matrix to determine if a current format should change status, as happened in 2023 with the movement of FFV1 in Matroska from an “Acceptable” format to a “Preferred” one, as detailed in the blog post on The Signal, “Embracing FFV1 in Matroska Container as a “Preferred Format” in the RFS” [6], if a new format should be added, as was 3MF (3D Manufacturing Format) [7] as an “Acceptable” format for 2D and 3D Computer Aided Design vector formats, or if a format should drop off completely, as happened with the MrSid still image format in 2023 [8]. All updates are listed in a change log [8].
Beyond the annual review and revision, the Library of Congress initiated a major overhaul of the RFS in 2020 with the addition of the standard model for assessment criteria for formats. This update established a “Level of Service” model which clearly defined the differences between “Preferred” and “Acceptable” formats, in addition to outlining the significance of these differences in practical terms. A standard evaluation matrix for all content categories brought tighter global factors (modeled on the Library’s seven sustainability factors [1] used to evaluate digital formats: disclosure, adoption, transparency, self-documentation, external dependencies, impact of patents, and technical protection mechanisms), as well as local and institutional factors, such as staff and systems capacity for handling various formats, software/hardware/operating system availability, representation/extent in LC collections/storage and established workflows. These evaluations help estimate the level of resources at the Library of Congress available to preserve and manage the content over time. The evaluation matrix model [9] is publicly available so that other institutions can adapt it with their own local factors.
The questions in the matrix are not formalized or specifically weighted. They are formatted to be answered in a few words (“Yes,” “No,” “Maybe”) with additional context if needed to allow for nuanced discussion of issues by the content team experts. Each of these factors may have different emphasis or importance depending on the community of practice and content type. Some may not be applicable or essential for every format.
Both halves of the equation have equal consideration in the final decision if a format is considered “Preferred” or “Acceptable.” Both evaluation criteria contribute equally to determining the long-term risk levels.
The Library of Congress uses the matrix to categorize a format as “Preferred” or “Acceptable” with these assessments:
Preferred formats:
Global/community factors: Meets or exceeds benchmarks for all relevant sustainability factors
Local/institutional factors: The Library of Congress has the skills, experience, workflows, tools, and systems to manage and preserve these formats in current systems with confidence.
Acceptable formats:
Global/community factors: Meets minimum acceptability across benchmarks or does not meet all relevant sustainability factors.
Local/institutional factors: The Library of Congress can manage this format at a basic level of acquisition, management, and preservation; and a greater ability for management and preservation is within the Library’s capacity with further investment.
The Library of Congress has to balance the competing needs of creating a universal collection of the creative works of the nation and the world and ensuring that collection is available and useful for generations to come. Given the nearly universal scope of creators this includes, the Library of Congress must be prepared to receive, manage, preserve, and serve materials in all formats and in an unknown number of file formats. Over the past twenty-five years, the Library of Congress has expanded its capacity in this regard, slowly but steadily, trying to meet its mission of building that universal collection without committing itself to attempting to maintain more content—and more types of content—than it can maintain and serve for the centuries to come. The Library of Congress has therefore tended to work programmatically, identifying and targeting specific types of digital material, developing means of including that in its collection development, establishing it as a routine method of acquisition, and using the lessons learned to build that capacity out more widely. By addressing digital serials acquired under legal deposit, the Library not only reached a point at which it was acquiring more digital than physical serial issues through this means annually, but began addressing questions of metadata and file formats more widely. Similarly, by using web harvesting to acquire the publications of the governments of the individual states of the U.S., the Library of Congress was able to maintain its unparalleled collection of these types of publications and begin to use web harvesting for other digital publications—governmental and other—being distributed via the web.
Now, the Library of Congress is at a key inflection point, as it takes the experience, capacity, and confidence that its success with building, preserving, and serving the digital content in its collection has provided it and moves to a more proactive stance with regard to digital acquisitions. Starting this year, the Library of Congress is moving from treating the acquisition of physical material as its default in collection building and begins the process of prioritizing digital content: a framework called “ePreferred” [10]. Given the scope and scale of the institution and its mission, this will move with the same prudent care as has been the case previously with digital content. It will primarily focus on commercial or mass-produced content and would, of course, require the existence of digital and physical versions of an item. Naturally, the Library will continue to acquire much in physical formats, especially for those unique items such as personal papers or ethnographic collections, for which the preference will remain the original format.
Yet it is a significant change in the vision for its universal collection; one in which leadership, management, and staff from across the institution focus on the benefits and drawbacks of acquiring different types of material in digital content first; in order to build a collection which is of maximal use to its users now and for decades to come. This migration to “digital first” or “ePreferred” will be hugely impactful for the Library of Congress over time. It is already the world’s largest library with official statistics from 2022 claiming “175.77 million items in its collections” [11] but digital items bring their own set of challenges. The sheer scope and scale of managing extremely large volumes of wildly heterogeneous digital content has significant impacts on workflows, tools, storage, and much more. For the staff of the Library of Congress in roles across the institution to do this successfully, it will require a greater emphasis on identifying the aspects and particulars of digital content which will further both building of the collection and ensuring its long-term access; in which having clear and comprehensible understanding of file format assessment will be essential.
Further improvements and enhancements to the evaluation matrix have been rolled out over the years. The 2023 version of the matrix included a new emphasis on access pathways including browser access via online catalogs. This covers a variety of platforms such as the main catalog at https://catalog.loc.gov/ [12]; PPOC, the Prints and Photographs online catalog [13]; Stacks, the primary system for accessing rights-restricted digital materials as well as internal and onsite resources [14]; and more. Planned work for 2024 will include expanding support for digital accessibility as part of the self-documentation sustainability factor. These include features such as tagged text for screen readers or captions and subtitles for audiovisual content.
One of the questions that the RFS has faced is that of specificity. How detailed should the RFS be in order to be useful? Is the file format preference enough or should it include more granular information about the technical structure and composition of the file, something akin to file creation guidance? For audio content, for example, should the RFS state a preference for specific bit depths and sampling rates? Will having more consistent files contribute to lower risk levels over time? There have been no decisions about this—just high-level ponderings about the usefulness and feasibility of this approach for an institution as varied as the Library of Congress to implement.
The Library does not have the legislative authority to demand content in specific digital formats outside of legal deposit for digital books and serials, which remains limited. The RFS is guidance for collections development staff to seek out content in “Preferred” or “Acceptable” formats but the Library cannot require that publishers use one format over another. The best that the Library can do is use its position of influence to encourage other stakeholders in the ecosystem of creative works to consult our guidance. Beyond that, the Library purchases large numbers of items for the collection every year and can present vendors and publishers with clear preferences on the part of the Library that can help determine which material should be prioritized and which vendors can best meet those needs.
The United States National Archives and Records Administration (NARA), with several decades of history accessioning and managing electronic records, faces an ongoing challenge in the multiplicity of file formats in its holdings, some of which may be decades old. This challenge required that NARA develop a methodology to analyze and visualize what it has in its holdings in order to assess and mitigate risk [15].
The Suitability Matrix, discussed in the Introduction, proved to be a platform on which to build a Risk Matrix for a formal risk instrument used to track the decisions underlying preservation action recommendations, with the addition of factors around internal prioritization for preservation actions. A data point was added to compare the number of files in the holdings of a given format against the entire corpus of holdings, or a file format’s proportion of the holdings. It also became clear that NARA’s current capability to process each format had to be incorporated into the risk assessment.
The Risk Matrix continues to use the core seven sustainability categories shared between the Library of Congress and NARA. Overall ratings from each of the categories are used to calculate a risk level for each format, which is added to the internal prioritization rating to calculate a NARA-specific risk rating. For each file format documented in the Risk Matrix, NARA develops File Format Action Plans (Plans) [16]. These Plans document the outcome of the Risk Matrix assessment (“Low,” “Moderate,” or “High Risk”), collate links to format specifications or documentation, identify the related “Preferred” and “Acceptable” formats from the NARA Transfer Guidance, and include the recommended preservation outcomes (transform records to new formats, procure/develop tools that enhance or extend NARA’s capability to manage records in that format, or to explore additional options), preferred transformation tool(s), and available viewer(s). The recommended preservation tools and actions for formats included in the Plans are always based on current NARA decisions and capabilities at the time. Additionally, NARA develops Record Category Action Plans to identify significant properties [17], [18] of record types (e.g. Email, Still Images) in NARA’s holdings that should be retained, if possible, in any format migration. Collectively, the Risk Matrix, File Format Action Plans, and Record Category Action Plans are known as NARA’s Digital Preservation Framework.
The first draft of the Framework was made available for public feedback in September 2019 through a NARA GitHub repository [19]. Suggestions were submitted from the digital preservation community about additional formats that should be included due to their ubiquity or high risk, and the level of granularity at which the formats were being described. Additionally, it was suggested that links to major community resources, such as PRONOM [20] and the Library of Congress Sustainability of Digital Formats [1] should be included, and that the Framework needed to be in a format for use in machine-actionable processes. These suggestions were accepted and integrated into the next version.
In mid-2020, the Digital Preservation Framework was officially released through GitHub, consisting of NARA’s File Format Action Plans for around 500 file formats and format variants common in NARA’s holdings (e.g. PDF 1.6 and 1.7 are separately analyzed), and 16 Record Category Action Plans. NARA maintains and updates the Framework on an ongoing basis in response to changing risks, new technologies, and newly analyzed formats. New releases are published on GitHub on a quarterly basis, accompanied by a Change Log. The Risk Matrix, along with the rubric used to determine the relative weights of each sustainability factor, are available as spreadsheets for institutions to download and adapt to their own needs [21].
The creation of standardized instruments for digital preservation risk assessment was necessitated by the U.S. government’s broader and concerted move toward a fully digital government. NARA and the Office of Management and Budget (OMB) issued the “Managing Government Records Directive'' (M-12-18) [22] in August 2012. The memorandum required federal agencies to eliminate paper and use electronic record keeping “to the fullest extent possible” by December 2019. Agencies were instructed to manage electronic records electronically, whether temporary or permanent, not to print them to preserve paper copies. This encompassed all born-digital formats—email, word processing, structured data, digital design, etc.—as well as SMS texts, encrypted communications, messaging apps, and social media platform content. It was this work that drove not only an update of the Format Transfer Guidance [2], but the creation of NARA’s first Digital Preservation Strategy in 2017 [23] and the Digital Preservation Framework [19].
This directive was followed up by the memorandum “Transition of Electronic Records“ (M-19-21) [24], which further stipulated that agencies must, as of January 1, 2023, transfer only born-digital and digitized records to NARA in addition to managing them electronically, forgoing the transfer of analog records unless those records have inherent value in the physical nature, such as hand-signed treaties. In part due to the impacts of the COVID-19 pandemic, the memorandum “Update to Transition to Electronic Records” (M-23-07) [25] was issued to shift the date to June 30, 2024.
In addition to preserving the records of government agencies, NARA is the custodian of records for two other areas of the U.S. government: legislative records (the official offices of the U.S. Congress, commissions, and committees), and the Executive Office of the President. The records of each custodial area are governed by different regulations and cannot be intermingled, therefore NARA has a suite of preservation systems, with each Presidential administration having its own instance of the Presidential preservation system. Collectively this suite of systems containing federal, legislative, and Presidential electronic records are known as the Electronic Records Archives, or ERA [26].
Assessing the relative risk posed to different file formats and scaling up processes for massive volumes of electronic records in ERA became urgent work due to this broader shift toward a fully digital government. To determine how many files of a given format are in NARA’s holdings, reports from the various ERA systems must be standardized, merged, and then analyzed. This provides a high-level view of file formats in NARA’s holdings, without regard to provenance. The Plans as described in the Digital Preservation Framework apply to files once they have been deemed permanent for NARA's holdings, recognizing that appraisal guidelines for permanent records differ between Congressional, Federal, and Presidential records [27]. For example, a software executable received in a transfer from a federal agency would likely not be considered a permanent record and would not be retained. In contrast, the same file received through a Presidential records transfer would be a permanent record, as all physical and electronic files received under the Presidential Records Act [28] are permanent.
Initially, the Framework focused on analyzing formats with at least 1,000 individual files identified in NARA’s holdings, based on the Holdings Profile. This threshold was later raised to 2,000 files due to NARA’s large volume of file formats and a need for additional parameters to guide prioritization. This number is not a strict rule, however, and additional factors are considered:
What is being received by custodial units. The Holdings Profile provides insight into what formats are in preservation storage, but some records undergo format migrations prior to ingest. By considering what formats NARA’s custodial units are receiving, rather than what they are preserving, the Framework can better document actions that should be performed during processing.
Anticipated future growth for NARA. Government agencies’ permanent records may be transferred to NARA shortly after their creation, or sometimes not for decades, depending on the records schedule. By maintaining an awareness of what formats records creators are using to conduct government business, NARA can proactively conduct early risk assessments, and update the Transfer Guidance Table of File Formats [29] for “Preferred” or “Acceptable” formats, when warranted.
Format uniqueness. When a format is found in NARA’s holdings that is un- or under-documented in other community resources, NARA endeavors to make findings about the format available to the larger community.
As the Digital Preservation Framework has grown over time, maintaining the Risk Matrix and scoring newly-added formats in a consistent manner has become more challenging. Since its inception, the Risk Matrix has been maintained in an editable spreadsheet that is contributed to by staff members within and outside NARA’s Digital Preservation Unit. There are several pieces of collateral documentation that serve as scoring rubrics, guiding staff on how to assign numerical scores for each risk factor and how to research questions that are harder to answer. Over time, several recurring challenges have arisen:
Despite the collateral documentation, staff members still assign scores differently, based on different subject matter expertise and subjective interpretation of questions.
Some questions have implicit logic that is followed inconsistently. For example, one question within the Disclosure category asks if the format has a published specification, and the next question asks if the specification has been approved and published by an internationally recognized standards body. If the response to the first question is “No,” the format does not have a published specification, the response to the next question should be “Not Applicable.” In practice, however, this second question has often been given a response of “No,” which compounds an already negative score.
Translating “Yes”/”No”/”Unknown”/”Not Applicable” answers to numerical scores introduces a point of failure and confusion to the scoring process, particularly because the potential numerical scores vary from question to question. For example, a “No” could be scored positively (as a 1 or 2), neutrally (as a 0 for unknown or not applicable), or negatively (as a -1, or -2) depending on the question.
These inconsistencies have made it difficult for the Digital Preservation Framework to “scale up,” because more and more time must be devoted to maintenance and correcting errors. While the number of file formats analyzed in the Framework has increased, the rate of growth has slowed. As of early 2024, the Framework contains 720 file formats.
Additionally, there are many formats for which there is very little public information. In these cases, the formats are given mostly scores of 0, corresponding to a response of “Unknown.” In many instances, this gives a relatively unknown format a more positive score than a format that is better known, with a high number of risk factors. After seeing several examples of this, the Digital Preservation Unit decided that, on a high level, formats that are more “Unknown” pose just as great a risk as formats that are known to have multiple risk factors, because it is not possible to plan for and mitigate unknown risks.
At the start of 2024, it had been a number of years since the scoring metrics and logic of the Risk Matrix were created, and it was time to revisit the assumptions implicit in the Risk Matrix and whether or not these assumptions still held. Some categories, such as Disclosure and Adoption, were weighted heavier than others, such as Self-documentation, in the overall risk rating. Within those categories some questions carried more weight than others. Furthermore, some questions in the Risk Matrix proved so difficult to answer in practice that more than half of the responses across all file formats were “Unknown.” In an effort to make the Risk Matrix more standardized, scalable, and better aligned with our current practices, the Digital Preservation Unit decided to perform a comprehensive review of all questions. The Digital Preservation Unit considered many different kinds of changes, including removing and condensing similar questions, updating the relative weights of questions, enforcing logic pathways for certain types of questions, and adding new questions to address additional risk factors that have emerged over the past decade since the initial Suitability Matrix work.
Ultimately, the Digital Preservation Unit made many small changes to the Risk Matrix that have a large impact overall. Several questions were removed, resulting in a smaller range of potential scores. Implementing the concept that formats with less information available about them are higher risk (i.e., “Unknown” is a response that has a negative impact) has translated to more formats with low scores across all categories in the Risk Matrix.
Not all of the challenges described above could be addressed in this initiative; however, they will be targeted in future work. The Digital Preservation Unit has drafted functional requirements for a database environment for maintaining the Digital Preservation Framework (including the Risk Matrix, File Format Action Plans, and Holdings Profile). This will give the unit more control over the workflow and logic pathways and leave less room for human error than a spreadsheet. Future directions may also include more robust implementation of the Risk Matrix and File Format Action Plans in NARA’s preservation systems. One of the unit’s goals is to implement support for auditing and reporting on risk conditions within our preservation systems, so that preservation actions can be taken on groups of files. This process could be either manual or automated, depending on system capabilities and the needs of custodial units. The Digital Preservation Unit is also hoping to implement closer integration between the Holdings Profile and the Risk Matrix, which would automatically connect the formats that are in the holdings with their various levels of risk. This more robust reporting could in turn justify and help prioritize the development of additional digital preservation functions in the future, based on the volume of at-risk formats known to be in the holdings.
The Library of Congress and NARA have both found that successful file format risk assessment models must allow for a consistent application while requiring flexibility to account for unique institutional contexts. File format risks, and our ability to assess and mitigate identified risks, are not static. Managing the ever-growing diversity of formats in our holdings requires maintaining awareness of evolving institutional capacity, new challenges, and emerging solutions.
While NARA and the Library of Congress have the scale of their collections in common, nothing about this assessment model is designed solely or explicitly for large institutions, but could be used by institutions of any size with an interest or mandate to assess its risks. Whether an organization stewards five or five hundred formats, the need to assess formats is the same. This approach could be used as is or by substituting or removing assessment criteria as appropriate; each organization should always set its own thresholds for low, moderate and high risk based on their own capabilities.
We hope that these two case studies serve as models for institutions of all sizes and regulatory differences. While approach to risk assessment in these two models has been hyper-local and contextually dependent, the public products produced by both organizations have proven to be useful to the international community as authoritative references that have been incorporated into other resources and workflows.