Skip to main content
SearchLoginLogin or Signup

File Fixity in the Cloud: Policy, Business, and Technical Considerations

Published onSep 05, 2024
File Fixity in the Cloud: Policy, Business, and Technical Considerations
·

Abstract – When shifting storage managed on-premises into the cloud, many digital preservation repository managers must reevaluate their approach to guaranteeing file integrity. Most cloud storage vendors offer guarantees of file durability that call into question the need for continuous fixity checking. At the same time, cloud vendors tend to charge fees for bringing computational resources to bear on individual files, turning continuous fixity checking into a potentially burdensome expense. With these technical and business considerations in mind, this paper explores one institution’s solution to verifying fixity in a repository of approximately seventeen million files after shifting on-site storage into the cloud. It evaluates broader considerations for the field of digital preservation when considering file integrity in the cloud and concludes with an explanation of a local policy for affordable ongoing fixity monitoring adapted to the affordances of a cloud-adapted architecture.

Keywordsdigital preservation, preservation planning, integrity monitoring, fixity, digital preservation policy

This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Inge Hofsink, Stefania Di Maria, Leontien Talboom and 1 anonymous reviewer. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.

Background

Many digital preservation repository managers have adopted or are considering adopting cloud storage to replace on-premises storage for their digital assets. Cloud storage relies on the same hardware (file servers, tape libraries) as on-premises storage, but is managed at scale by large corporations, allowing them to sell access to it at a reduced cost. Time consuming tasks like refreshing server hardware without losing data, monitoring and replacing disks, and patching and updating server operating systems are taken on by the vendor, and remain largely invisible to the consumer of cloud storage. The shift to the cloud however signals more than a subsidization of certain cost categories achieved through economies of scale. This becomes most evident when looking into the example of our library’s attempt to “lift and shift” our on-premises repository architecture into the cloud, and the consequences this had for our digital preservation practices, in particular fixity monitoring.

For some background, the PREMIS data dictionary defines fixity as “information used to verify whether an object has been altered in an undocumented or unauthorized way” [1]. The National Digital Stewardship Alliance defines it as “the property of a digital file or object being… unchanged,” with the intent of providing “evidence that one set of bits is identical to another” [2]. Digital preservation repository managers commonly accomplish this by running an algorithm such as md5 or sha1 on a file and saving the resultant hash value. If the same algorithm is run on the file at a later date it will produce an identical value, as long as the file’s zeroes and ones have not changed.

Our field’s most commonly referenced digital preservation guidelines and assessment tools all recommend that we track fixity, also known as file integrity, whenever transferring files from one storage point to another to protect against data loss, and to do so periodically when files are at rest to verify the health of our storage technology. Integrity is one of the five functional areas identified as essential to the National Digital Stewardship Alliance’s Levels of Digital Preservation (the others are Storage, Control, Metadata, and Content) [3]. Fixity Information is a required component of Preservation Description Information required by the Reference Model for an Open Archive Information System [4]. The Digital Preservation Coalition lists “processes to ensure the storage and integrity of digital content to be preserved” in its Rapid Assessment Model as an essential service capability for any organization with responsibility for digital curation [5]. Of 116 survey respondents in a 2021 study of fixity practices undertaken by the National Digital Stewardship Alliance, 97% confirmed that they utilize some form of fixity information in their organization, underscoring the importance of file integrity to the digital preservation community [6]. The same report also indicated a recent increase in the usage of cloud storage among its respondents, but lacked conclusive insights into how institutions are monitoring fixity in the cloud.

For the field of digital preservation, monitoring file integrity in cloud storage is not an exclusively technical challenge. In fact, the pursuit of a technical solution will inevitably call into question fixity management practices common in digital preservation repositories for storage managed on-site, not to mention considerations of repository architecture, file management policies, and financial and contractual commitments with cloud vendors. 

A Case Study

Local Digital Preservation Infrastructure

The University of Illinois at Urbana-Champaign Library is home to the locally managed Medusa digital preservation repository [7]. Medusa provides a web-accessible management interface to preservation microservices and storage for the Library's digital collections and publicly accessible repository platforms [8]. Presently, Medusa houses approximately 17 million files, comprising 261 terabytes on disk. Written in-house as open-source software in Ruby on Rails [9], Medusa has been in production since 2014. From 2014-2019 we hosted Medusa storage on-site either in the Main Library building or in a campus data center, with cloud backup in Amazon Glacier. In 2019, we shifted Medusa’s web application and storage fully into the Amazon Web Services cloud, completely eschewing storage on our premises [10].

We did not architect Medusa for the cloud. Rather, we intentionally constructed Medusa’s data model and the microservices it deploys to hew closely to the file system (specifically, Linux ext4) as a relevant and useful component of digital preservation infrastructure. Our reasons included:

  1. A file system hierarchy makes sense intuitively and intellectually.

  2. The file system is the native environment of electronic records. Storing them in a file system provides the simplest path to accurately representing archival concerns like original order (represented by the folder and file hierarchy) of acquired materials. This applies for files born natively on Windows or Mac operating systems, even when stored on a Linux file system.

  3. Most digital preservation tools designed to be run as repository microservices (e.g. FITS or checksum verification) have been written to run on a file system.

  4. A file system is portable. If repository managers need to pick things up and move them somewhere else, they can.

  5. While important differences exist between operating systems, a file system is largely software-independent, whereas object storage options tend to be vendor-specific with a risk of locking users into proprietary technologies.

Shifting to the Cloud

In 2017, our technical team began investigating cloud services as part of an enterprise-level IT strategy to move away from hardware maintenance wherever possible, and to simplify the effort needed to maintain infrastructure. Our goal was to reduce the time IT staff spent on operational overhead in order to have more time to work with stakeholders. We gave preference to Amazon Web Services (AWS) because our university had a contract in place with them featuring favorable terms of service and dedicated technical support. We also anticipated our repository services growing in scale in the near future, and liked that AWS offered convenient options for increasing storage and computational capacity on demand. We aspired to implement a “lift and shift” approach to picking up our Medusa infrastructure and depositing it into the cloud, with the intent of iteratively taking advantage of cloud-specific optimizations over time.

With this in mind, we investigated the AWS file system cloud storage option known as the Elastic File System (EFS). While EFS would have simplified our lift and shift, it also would have cost the Library $45,000 a month to house our repository, which then totaled 150 TB. The prospect of paying a minimum of $540,000 a year for file system cloud storage was not within budget, which jeopardized our ambition to retain our technical architecture as it stood.

As an alternative, we investigated a more affordable option on offer from AWS called the Simple Storage Service or S3. S3 is not a file system, but a flat object store of infinitely scalable containers called buckets. In these buckets, bitstreams are identified by unique key values. It turned out that S3 offered the only path to the cloud that would keep us within our budgetary constraints. We adopted it as a technical compromise, which required our developers to rewrite significant portions of Medusa code to ensure the continuity of workflows that interact with storage, especially those that assumed the presence of a file system. As a workaround, we assigned the file path for each object as its S3 key, which allows us to reconstruct the file hierarchy when needed. We do not know if there is precedent for this or if it is a recommended practice.

To ensure file replication across geographically diverse regions, we decided to house our content in two Amazon data centers in the United States, the primary located in Ohio, and the secondary in Oregon. The primary data store utilizes intelligent tiering to defray costs. With intelligent tiering all files are readily accessible at the same speed, but frequently accessed files live in a more expensive storage tier than those that are rarely touched. Exact details of what sort of technologies underlie each tier is opaque, but Amazon does provide clear indicators of how much content exists in each tier at any given time. Medusa differs from many standalone digital preservation solutions in the way that our digital preservation services are connected in an interoperable manner to several access systems, among them an institutional repository, a data repository, and a digital collections portal. This allows us to keep preservation management in sync with access services and provides our curators with a holistic digital curation environment. Even with this active access component, most files in our repository are rarely accessed, with 92% living in the “Archive instant access tier” of infrequently touched digital content. (see Figure 1).

Category

TB

Percentage

Archive Instant Access tier

233.7

91.62%

Infrequent Access tier

14.6

5.76%

Frequent Access tier

6.5

2.55%

Standard Storage

0.2

0.07%

Figure 1. Intelligent tiering when applied to locally managed repository content.

Our secondary data store, which replicates the primary, utilizes Amazon’s data archiving solution Glacier Deep Archive. Glacier doesn’t feature tiering because it is designed to exclusively house data that is rarely accessed. In our primary and secondary storage, we utilize file versioning so that we can retrieve deleted versions of files if needed through an administrative console. We have explored introducing a non-Amazon option for an additional file copy for greater assurance but have not yet done so. We have been advised that many other large repositories utilize Wasabi as a second provider, which supports the S3 API.

Investigation of Fixity Monitoring in the Cloud

In the years that the Library hosted Medusa using on-campus server resources (2014-2019), Medusa featured continuous fixity monitoring as a background software feature, logging 3-4 fixity checks per file every year. On several occasions, failed fixity checks were beneficial in exposing errors in storage management in our data center, which we were able to recover from using file backups. While we verified fixity during our initial data transfer into S3, our continuous fixity monitoring feature depended on a file system. Given competing priorities, the introduction of emergency pandemic services [11], and staff departures from 2019-2022, we were not in a position to restore fixity monitoring in our repository until 2023.

Meanwhile, AWS storage publicizes 99.999999999% durability of data, suggesting that file fixity is carefully monitored and managed by the vendor [12]. Digital preservation professionals have speculated that on Amazon’s end they store three copies of each file, and when they discover a problem with a failing version they replace it with a durable one [13]. However, this process does not occur in a transparent manner, and digital preservation managers often feel a responsibility toward their stakeholders to verify that the word of their vendor is good. In our case, we were not willing to trust our cloud vendor’s claims of fixity before doing a comprehensive audit of current file integrity. In addition, we knew that we could not put off fixity monitoring any longer if we wanted to call ourselves a digital preservation repository.

We began our investigation by talking to our AWS account manager, who recommended an out-of-the-box solution using Lambda [14]. Lambda is an AWS service which allows for running code without provisioning servers. It provides computational processing on a large scale in a short period of time, but at a potentially high cost. In our case, we estimated that we could run the entire Medusa repository through a Lambda-driven fixity check in about a week, at a cost of $25,000.

Alternatively, we had a solution in place for instantiating files in a temporary file system which we use for microservices like FITS run during repository ingest. This would have entailed pulling all items in our primary storage out of the Archive Instant Access tier and into the more expensive Frequent Access tier, and bringing compute resources to bear on the items. We estimated that using this approach to running fixity checking in a temporary file system across the entire repository would cost us approximately $30,000.

Dissatisfied with these prices, we then reached out to peer institutions which were known to have adopted cloud storage for repository services. We discovered a meaningful difference in scale between the data they housed and our repository. While we were managing hundreds of terabytes, many of them had not committed more than hundreds of gigabytes into cloud storage. Some of them had been willing to take advantage of a Lambda-based solution at a fraction of the price it would have cost us to do the same. We did however take inspiration from the APTrust, a large consortial digital preservation repository utilized by many North American academic institutions, who reported spinning up an Elastic Compute Cloud (EC2) instance and streaming objects through it in order to run continuous fixity checks in the same data center where they store their content [13].

Our solution

Looking to the APTrust, we decided to check the fixity of all files in our repository using an EC2-based solution. Having balked at the cost to verify fixity in our primary storage and uncertain we could afford to verify fixity on all versions of all our files, we decided as a preliminary step to at least verify fixity on our secondary Glacier Deep Archive storage. When it came to estimating the fees we would incur, we had to compare a variety of compute options on offer [15]. These included comparing whether it was more advantageous to take on lower data processing fees over an extended period of time in a T4g instance class incurring CPU credit charges, or higher processing fees within a shorter time window in a C6g class charged exclusively by the hour. Even though lower processing fees appeared attractive at first, in the end we decided we would save money by paying a higher hourly rate for increased computing capacity, with the goal of processing our files in the quickest manner possible. For this reason, we settled on a C6g instance optimized for high performance computing and its pay-by-the-hour inclusive fee structure.

We developed a process to identify a number of files from the Medusa database every night and send a restoration request through S3 Batch Operations to retrieve them from the Glacier Deep Archive bucket storage class [16]. The restoration of these objects triggered an SQS message once the files were ready for processing. We then utilized a Dynamodb global secondary index to establish a primary queue of these files, which we processed on a first-in first-out basis. We tested checksums against originally stored md5 values, and used an SQS queue to send notifications back to the Medusa application reflecting the updated fixity verification information every morning. We monitored and refined the process, iteratively improving it from processing 50,000 files a day and gradually building up to 300,000 a day (see Figure 2). Having needed time to optimize and scale up our process, we verified fixity of our first files in August 2023 and completed the repository-wide audit in January 2024.

Figure 2. Diagram of locally implemented process for verifying fixity.

Results and Total Cost

Of our 17 million files, about 1600 failed their checksum verification. They were all from a specific time period of ingests when the way that our checksum generation feature, specifically for files that curators had replaced on disk, was undergoing changes. Based on a quick analysis, we determined that the erroneous checksums were due to a process failure in this specific workflow, rather than storage failure. As Andrew Diamond of the APTrust has stated, “You may set up fixity checking thinking that it’s going to alert you to problems in your hardware, and then you find out it actually alerts you to some problems in your process. And either way, those are things you want to know” [13]. We were able to address this by verifying the correctness of the files and storing them with new checksum values.

Taken in all, our comprehensive fixity audit cost us $1770 in AWS fees (Figure 3), a fraction of earlier estimates that would have relied on different technical approaches. That being said, the financial planning, the comparisons of technical approaches, implementation of a technical solution, and ongoing monitoring of said solution required a significant commitment of software developer and co-author of this paper Genevieve Schmitt’s time and effort over an eight-month period. Were we to factor her time into our total costs, it would have been less expensive and quicker to run the costly Lambda solution initially proposed.

S3

$1,365.70

SQS

$41.44

EC2

$228.08

Dynamodb

$127.42

Cloudwatch

$5.4

Data transfer

1.99

Total

$1,770.02

Figure 3. AWS fees incurred for fixity audit.

Local Fixity Monitoring Policy

We emerged from our fixity audit reassured that our secondary storage was living up to its highly touted durability guarantees. We also confirmed that we need to monitor files recently added to the repository to identify problems related to process failure in a timely manner. While we did not audit our primary storage, it is subject to the same durability guarantee as our secondary storage, and we decided to develop a policy that would both trust this and seek to verify its reliability on a modest ongoing basis. To this end, we decided to define two categories of files:

  1. New. This refers to files ingested or updated in the past two years.

    1. Newly ingested. We consider any file that has been deposited into Medusa in the past two years to be new.

    2. Updated. In certain rare instances, curators will update a file after discovering a problem in it (for example, an incorrectly ordered page image in a book). The process of updating a file entails removing it from Medusa and replacing it with a new file with the same name and the same position in the storage hierarchy. In this case, a new md5 value is created to reflect that the file has been updated. We consider any file that has been updated in the past two years to be new.

  2. Stable. All files stored in Medusa are considered fixed, in that they are not meant to change. All fixed, stable files should be able to pass an integrity check against the md5 value created on their initial deposit. We consider any file that has been in Medusa for more than two years, and that has passed at least two fixity checks, to be stable.

Based on this analysis, we are adopting the technical policy below:

  1. New content: Run two fixity checks on all new items in primary storage as defined above within two years of deposit or updating. We recommend that this occur three months after ingest or update, and one year after ingest or update. The purpose of this is to catch flaws in work processes and underlying storage management in a timely manner.

  2. Stable content: Run continuous random sampling on all stable content on secondary storage in a financially responsible manner. Our 2023/2024 fixity review did not uncover any areas of concern for stable content, and suggested that the underlying fixity management guaranteed by our current storage provider is functioning as promised. The goal here is to trust that this level of file durability will continue, but to verify this durability at a modest pace in order to uncover flaws in storage and file management, should these arise.

We have yet to write or deploy code to reflect the decisions implied by this policy. As such, the policy is subject to change as technologies evolve and colleagues in the field of digital preservation offer their input on our work.

Analysis and Future Considerations

While there are great benefits to moving entirely into the cloud, a full cloud implementation of repository architecture will necessitate technical changes and compromises. If a repository measures the data it manages in terabytes rather than gigabytes and is not already utilizing object storage, the current market in cloud storage will likely push it in this direction. This poses challenges when many digital preservation tools and practices presume the existence of a hierarchical file system. The adoption of cloud affordances or vendor-specific solutions to technical challenges comes with the inherent risk of making repository solutions less portable than those that rely on shared standards and broadly used open technologies. While digital repository managers must continuously monitor their field for emergent best practices, policies and procedures for cloud storage must take cost as well as technology into consideration.

The challenges involved in shifting to the cloud lack clear best practices, meaning there are many uncertainties involved in how to best establish pragmatic fixity monitoring regimes and policies. Chief considerations include:

  1. Cost categories in the cloud differ from those incurred on site. Some things that used to be expensive are now affordable, and some things that used to be affordable are expensive. When seeking to adopt cloud services, it is not immediately clear what these will be without detailed planning and analysis, causing technical decisions to have important financial consequences. To succeed, organizations need IT staff who can understand a new panoply of implementation options, often those unique to a specific vendor, and how to leverage them in a way that accomplishes one’s digital preservation goals without going over budget.

  2. The pricing of cloud services may steer organizations toward the adoption of new and sometimes proprietary technologies, leading to compromises and changes of practice. These end up having important effects on workflows and system architecture. While it is unwise to place the cart of technology before the horse of one’s goals, we must acknowledge the necessity of compromise when adapting to a shifting technical and financial landscape.

  3. The field of digital preservation lacks clearly defined best practices for managing storage in the cloud. This is complicated by the broad variability between cloud vendors and the specialized services they offer. As a result, many organizations find their way as they go when implementing cloud storage. Even while attempting to ground technical decisions in digital preservation principles, it is often difficult to know what approach is best.

The implementation of digital preservation infrastructure tends to emerge through discussion between digital preservation professionals and colleagues in information technology. When shifting on-site storage into the cloud, digital preservation professionals who have striven to justify what differentiates their preferred storage practices from those prevalent in enterprise-level archiving may find themselves reevaluating and renegotiating said practices anew. The decisions around digital preservation practices are as technical as they are financial; in fact, when favoring a cloud-first approach, these two go hand-in-hand. The pursuit of a technical solution to assuring file integrity in the cloud will inevitably call into question fixity practices common in digital preservation repositories for storage managed on-site, not to mention considerations of repository architecture, file management policies, and financial and contractual commitments with cloud vendors. For us this has meant cultivating a detailed understanding of our cloud architecture and the financial consequences of technical actions taken in it, in order to best align our approach to our budget and our digital preservation principles.

Comments
1
Andy Jackson:

Might be interesting to compare this with this older paper on fixity checking while running on the cloud.