Skip to main content
SearchLoginLogin or Signup

Managing the Continuous Growth of a Repository for over 14 Years

Problems and Solutions for an Ever Expanding Open Archival Information System

Published onAug 29, 2024
Managing the Continuous Growth of a Repository for over 14 Years
·

Abstract – The National Library of France (BnF) has been running a multi-purpose preservation system as soon as May 2010. Since then, it had to adapt for internal growth with the arrival of many different kinds of material and the management of hardware renewal.

In 2018, the merge with the audiovisual repository forces the team to find innovative ways to handle this whole new class of materials and still keep with the regular operations.

Two recurring themes that arise across all these events is the need for abstracting and for partitioning both the storage and the work to be done on data. This kind of strategy allows for managing growth and change without losing too much of one’s mind.

KeywordsMigration, Replication, Repackaging, Scale up, OAIS.

This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Dr. Stephen Abrams, João Andrade and 2 anonymous reviewers. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.

Introduction

In 2004, the National Library of France (BnF) decided to implement a multi-purpose preservation system, called SPAR (Scalable Preservation and Archiving Repository) [1], anticipating that, beyond the digitization program that was already undergoing, much more material would need such a system. Indeed, even though today more than 10 million documents have been digitized and are preserved in the SPAR system, a good 6 million more is also ingested, ranging from web-harvested collections to born-digital acquisitions or donations, to documents subject to digital legal deposit1. The system went live in May 2010 and has never stopped operating since. After 14 years of operations and more than 6PB of data ingested, some lessons on how to deal with growth and change, both expected and unexpected, can be drawn.

Dealing with Internal Growth and Hardware Obsolescence

Designing the Storage Module around Abstraction

One of the initial understandings was that software and hardware have different lifecycles. Indeed, while the software is regularly improved, particularly if agile development practices are in place, the hardware is expected to be replaced only when it reaches its end of life. However we have witnessed that the replacement occurs way sooner than that, the causes being almost always on the side of the storage equipment manufacturer. We may cite as a recurring theme either the lack or the high cost of maintenance, which renders the use of otherwise perfectly fit storage equipment (tapes as well as drives) a hazard on the medium term. We may also cite business issues such as the acquisition of a manufacturer by another company that decides to drop its support for strategic pieces of equipment. Ultimately, it is our experience that the hardware lifecycle is about 6 to 8 years, which is expected by the industry [2], even though tape technologies are supposed to last for decades [3].

Hopefully, the design of the SPAR system allows the separation of the software from the hardware. The use of the Integrated Rule-Oriented Data System (iRODS) [4], as a core part of the storage module, allows a uniform management of the underlying hardware without disturbing the applications. We make use of the concept of composable resources [5] to provide the application with “Storage Units” (SU) that abstract the underlying hardware by giving access to storage associated with service level agreements. Each SU is associated with various properties, like number and type of copies, uptime, access time, that expose the service level it provides. A Storage Unit then, is an abstraction around practical Storage Elements, which are each either a pool of tapes or of disks located in one location. Thanks to iRods, a Storage Unit automatically takes care of data replication across its Storage Elements, which must each provide the same amount of storage, regardless of the technology used. Furthermore, each piece of data coming from the application (in this case “Archival Information Packages”) is abstracted away in what we called a “record”, having multiple replicas according to the service level of the SU where it’s located. Each record is associated with properties (like size, name and checksum) used to fulfill all the operations related to storage, especially auditing.

We call this layer of SPAR the Storage Abstraction System (SAS2).

As abstractions, Storage Units allows two things:

  1. The software is not impacted by the technicalities of having to connect to different kind of technologies at the same time.

  2. The storage itself is partitioned (both virtually and physically), which allows separating documents that would fall under different jurisdictions, and also as we will see to work on separate sets of documents step by step.

This will prove useful as we will see below.

As a policy, every document is stored at least twice: two copies on tape on different, geographically distant locations and using different technologies (an open one, LTO, and a proprietary one, initially StorageTek and now Jaguar). Doing this, we follow the usual best practice to have multiple distant copies with different technologies, as reminded in [6] or [7].

Undertaking Replication Migrations, Lessons Learned from the Past

As stated in the OAIS reference model [8], a Replication Migration is “A Digital Migration where there is no change to the Packaging Information, the Content Information and the PDI3. The bits used to convey these information objects are preserved in the transfer to the same or new media-type instance.”

Over the years, we already had to carry out two replication migrations in order to cope with maintenance’s excessive costs. Beyond the financial aspects, the additional benefits were the gain in performance and the densification of storage. All of these aspects were part of the original plans for SPAR and proved true over the years.

To give a quick overview, the different replications were defined as follows:

  1. In 2011 it covered 78TB of data, for 209 thousand records. There were 5 storage units and lasted 5 months,

  2. In 2017-2018, it covered 1.5PB of data, for 3.6 million records. There were 23 storage units and lasted 18 months.

As these numbers show, the increasing amount of data to process tends to increase the time to process the whole migration. However, the use of Storage Units makes the replication manageable. Indeed, we define arbitrary limits on volume (100TB) as well as on number of packages (200,000) for each storage unit. This way the overall migration is divided into manageable parts and reachable goals, by the virtue of partitioning. The end of the migration could also be extrapolated as we went and a plan could be elaborated to prioritize which content should be processed first.

Moreover, because of the linear nature of tapes, the only way to achieve a reasonable time for copying all the content is to retrieve it in a linear way (driven by the tapes themselves) and not trying to migrate content from an abstract perspective, which would result in random accesses that are not suited for tape technologies: this is a clear limit of our Storage Abstraction System, since we lose all notion of physicality in the process. As a consequence, we had to update each record with the information on tape identifier, so we could operate one tape at a time, flushing all the content in a work area and copying the information from there. An additional benefit is that we can choose the location of the replica used to make the migration. In general, we use the replica located in the tape library with the most tape readers.

Finally, the only way to ensure no information is lost is by computing the replicas’ checksums and checking them against their respective records’ ones, preserved in the SAS with iRods. In case of a discrepancy, a good replica is used to restore the offending one. This process is called “auditing” and is integral part of the services rendered by the SAS. Ideally a complete audit should be realized in a regular or maybe even a continuous manner. Unfortunately when dealing with this much data this does not always prove practical or possible, due for instance to a limited amount of drives available while current operations may be prioritized. The process of replicating the information in a regular basis is therefore also a way of auditing (and hence if needed recovering) all the AIPs in a systematic and complete manner.

With past replication migrations, the principles that proved essential were:

  1. Chunking operations around Storage Units, allowing visibility and planning.

  2. Allowing integrity checks and eventual restoration of corrupted data.

With these lessons is mind, we are able to redesign our replication process.

Designing a Massive Replication with Partitioning and Abstracting

In 2023, a new migration is required. It is a major one since not only the generation of tape is upgraded but the tape libraries themselves need to be replaced, as we had to change manufacturer.

Moreover, the migration numbers are overwhelming: it now covers 4.5PB of data for 11.8 million records across 82 storage units. At the time of writing these lines, we forecast a duration of 12 months; the migration began on 2023-07-06.

With time we were forced to improve the way replication was operated. At first, an Excel file per storage unit was enough to ensure each package was replicated, that the fixity checks were run and appropriate actions done if any failure on a tape or a drive was discovered.

But with the huge growth of our repository, we had to first instantiate a bunch of scripts to automate the process. So, we finally developed a custom application, called Spardeck4, which includes a dedicated database that takes care of the whole replication giving a clear view of the progress (see Fig. 1) and allowing easy interactions, such as: choice of the storage unit to replicate, quick pause and resume…

Figure 1. User interface to manage the replication

Furthermore, an important aspect to keep in mind is that this migration is the real moment when obsolete records are physically deleted. When a deletion or a replacement is asked for during daily operations (refer to [9] for when such operation occurs in our preservation system) the record is just dereferenced but the data itself is not erased from the tapes. The red part of the pie chart on the right of Fig. 1 represents the proportion of records that will not be replicated because they were previously dereferenced.

The adjunction of a new application, Spardeck, and specialized scripts used to carry the operations, proved essential and already allowed to migrate 2PB of data in 6 months (as of march, 2024). These numbers are encouraging, but we remain cautious since for the moment we have focused on the oldest records stored in LTO6 tapes; the remaining records are stored in LTO7 tapes which prove to be surprisingly less reliable5.

We can see that replication migration is mainly a technical operation focused on preserving the integrity of the data and allowing the continuous run of the storage. It is a virtue of abstracting away the data as records in our SAS. This allows this kind of migration to be performed essentially by the IT teams, without involving much the Archive’s stakeholders. This will not be the case for the next case study below, which is another kind of migration that takes place at the same time and involves a complete change of repository.

Migrating from One Archival Repository to Another

The Audiovisual Repository

At BnF, SPAR was not the only repository archiving digital documents. In the past, all audiovisual material was managed and archived in a dedicated system due to its specificities and the kind of accesses it underwent. Indeed, twenty years ago, audiovisual material was all on physical carriers either with an analog signal (magnetic tape cassettes, videocassettes, vinyls …), or a digital one (CDs, DVDs …). Thus, their communication was done through specialized equipment than could be automated by specialized robots that handled CDs or videocassettes or by staff who would handle the carrier to insert it into the appropriate reader at the user’s request. Rapidly, it became clear that many of these carriers were fragile and subject to degradation over time. Therefore, an important program to digitize (or dematerialize) them was undertaken, driven by these preservation purposes. The new policy was then to only communicate immaterial digital surrogates of these audiovisual documents. In the meantime, current production has become digital first with the increasing apparition of online publications which are now the norm and not the exception anymore. At the same time, digital audio and video has been democratized and the library began to handle this kind of content in a manner analogous to textual documents and images.

In 2018, the library made the decision to merge the handling of these various kinds of content, in order to both pool resources but also to allow widespread use across the library’s various departments. Therefore the previous audiovisual repository needs to be migrated in the main preservation system: SPAR. Notably, this represents approximately 2PB of data, ranging from sound and video, to games and applications, from a variety of contexts ranging from digitization of periodicals to ROM extraction from old game cartridges, and much more.

It must be stressed that not only these repositories did not operate under the same packaging standards; they also did not use exactly the same storage technologies.

The first step was therefore to pool the hardware so that only one kind of tape libraries was used for all the content. Before that, our two systems each used their own storage infrastructure. After that, the audiovisual system (or SA, for “Système Audiovisuel” in French) would use the storage infrastructure of SPAR, virtually partitioned to handle those two separately.

Beginning with a Replication Migration of the SA, Anticipating Repackaging

This initial migration shares many aspects with the one described in the previous part:

  • It is a replication migration. The data is intended to be replicated in a new storage infrastructure in order to retire the old one.

  • It involves copying files from one place to another, while rerouting accesses to its new place.

Much more interesting for our discussion is the list of its essential differences:

  • While there was a list of packages present in the SA (Audiovisual System), these packages (called “boxes” in this context) were stored directly on the tapes. Technically, each file contained in the SA was an entry on the tape system. A box then, is simply a directory containing files, stored as a TAR record on tape [10]. This renders the accesses less efficient and more error prone.

  • As a consequence, there was no checksum to verify at the level of the package, contrary to those of SPAR. Checksums eventually exist at the level of files, but they are written in metadata files that are part of the packages, and thus unusable for this migration. They will be used and verified in the next migration described below.

  • There was no notion approaching that of Storage Unit, so we could not use SPAR’s existing toolchains.

This list illustrates the main differences in philosophy between our two historic systems. By contrasting the two we can already point out ideas that render the work on migrations more manageable:

  • Grouping content of packages (aka “files”) in an archive container is easier on the storage infrastructure. In SPAR we use tarballs, it could have been ZIPs.

  • Grouping content of packages in an archive container allows storing a checksum at the level of packages, allowing for automatic auditing in the process.

  • A consequence of the last point is that after a migration, we can prove that no file has been left behind. This will not be possible for this migration.

Two decisions were made as a consequence:

  1. The process would rearrange the boxes, storing files of each box in a ZIP archive, without any compression. At least this would render the reads and writes more efficient, and a ZIP archive may be tested for fixity.

  2. The boxes would be arranged on the new storage using criteria already known at this time to batch the next migration. Essentially, using the call number of the copy used to digitize the data, we organized the packages by “collections”, mimicking the Storage Units. This would facilitate the next migration (see next section below).

As a recurring theme, we can see that abstraction and partitioning can make a huge difference in the way we may perform this kind of migration.

Repackaging the Audiovisual Material

At this point in time, although the migration of the SA with rearrangement is done and the storage hardware is shared between our two repositories, they remain two very distinct systems. As a remainder, the goal for the library is to manage one system instead of two. However, the choices of description, formats, structure and access types are different from those of SPAR. Therefore we need to understand the contents and their metadata format before being able to transform them to become suitable SPAR packages.

This is what is called a “Repackaging Migration”, which is defined in the OAIS as a “Digital Migration in which there is an alteration in the Packaging Information of the AIP”.

Because in the audiovisual repository, all technical metadata are described in files that roughly follow the syntax of INI files [11] but with attribute names that have fluctuated over time and stakeholders, we first had to define a minimum set of mandatory metadata (e.g. digitization date), and associate several possible sources for them. When this work was done for each batch of data that we identified, we could find a place for each of these data in a SPAR package: mapping them in a METS-PREMIS syntax, see [12] & [13], to be aligned with the usual metadata standards. This exercise proves very useful by making much more explicit the meaning of each information that is collected.

This is why in the case of a Repackaging Migration, it is of utmost importance that the content ingested in the new repository is adequately mastered by our collection managers, as illustrated in [14]. They are the people who carry the history of the collections, and therefore the knowledge about their quirks and idiosyncrasies. Furthermore, because such a repository is so huge (about 1.2 million documents) it is bound to contain data of a wide variety, as detailed earlier. And such variety implies that specific data and metadata may and will be encountered for each kind of document (and sometimes, within a same type of document as well).

In order to implement this in a manageable manner, the idea is to define homogeneous batches of content that meet the required criteria, namely:

  • A bibliographical record exists.

  • All the components (in case of multiple carriers) are present.

  • The file formats are known to SPAR.

  • The metadata associated to each component are appropriately populated.

  • The structure of each component falls in an expected pattern.

Each batch is an abstraction of a homogeneous mass of content, which can be treated as a single unit of work.

A tool, called SelectSABox, was designed to facilitate the definition of each batch, based on the kind of content (currently: content coming from the digitization of audio CDs, VHS, audio cassettes, and DVDs). Thanks to this tool, we were able to define a first set of batches with broad profiles; then we broke them down into more refined batches that represent one specific goal. We saw that we could master a large portion of the content with a few batches that could be migrated immediately, which is a good example of the Pareto principle, which states that “for many outcomes, roughly 80% of consequences come from 20% of causes” [15]. This freed up time to focus on more problematic cases, moving from the most general cases to the most specific ones.

In order to actually migrate those batches, we take advantage of three other tools:

  • a first one to implement the workflow of processing: retrieve the content from the old system, update the bibliographic record, ingest in the new system, call to the access workflow, flag to notify the success or failure of the migration;

  • a second one, see Fig. 2, to monitor the different steps and manage the occasional errors;

    Figure 2. Interface to monitor repackaging

  • a third one to monitor the overall progress of the migration.

These multiple simultaneous perspectives allow for close monitoring and facilitate interventions to repair or extract problematic content, which will feed into new batches to be processed later.

This last point is something we learned along the way: when you already defined a rough but precise enough batch, let the system tell you what’s wrong instead of trying to predict it beforehand. This alleviates the preparation for each batch and informing the future decisions with actual error types instead of having to guess beforehand. For instance, with our first batch we tried to anticipate what could go wrong. We thought of several cases, such as the presence of files in formats not yet managed by our system. However for this case, the system that is able to conduct this survey in a systematic manner is precisely the one in which we want to migrate the data… Therefore this survey was made manually and, perhaps predictably, we did not catch all problematic formats. Fortunately SPAR did it for us and refuses those information packages. However, the migration system was not yet made to handle failure, and these packages simply stopped advancing their migration process, thus cluttering the working space. This is why we changed our strategy, added a migration status in SelectSABox that included “abandoned” as a possibility, and patched our migration tools to set this status when the package was rejected by SPAR. Now problematic data can be extracted from the current batch and added to a new batch to be processed later on, after either upgrading the system, correcting the data, or both.

An important aspect to bear in mind is that this kind of migration (Repackaging Migration) should be seen as a complete preservation project of a collection. Indeed, this migration can be compared to the stocktaking6 operations: the benefit is the detection of missing items and the launch of additional digitization programs.

Once again, we can see that partitioning both the data and the work around this data is the key to even hoping to perform such a migration: the progress is made in an interactive way from the most general cases to the most specific ones.

Furthermore, it is our experience that, contrary to a Replication Migration, a Repackaging Migration cannot be adequately performed by the IT teams alone. Stakeholders in the library are necessary and welcome assets to this kind of work. This is very important to fulfill one goal of the merge which intends to widespread the use of audiovisual and add this kind of material to the usual handling.

Conclusion

In order to deal with this massive flow of information, we tried to divide the operations in smaller chunks (partitioning) that have homogeneous features related to the operations to manage. This way we can handle the vast majority of the migrations in a much automated way. We can then focus on the corner or edge cases that may require adjustments, evolutions or manual operations.

When handling this amount of data, it's of no surprise that unexpected cases do arise. Limiting their impact on the overall process through abstraction is a good way of ensuring progress can be made nevertheless. Following the Pareto Principle, focusing of the 20% of efforts that contributes to 80% of the work makes those migrations efficient and timeboxed.

Finally, monitoring is of utmost importance, it’s an effective way of measuring progress as well as quickly identifying bottlenecks and ultimately getting rid of them.

Acknowledgments

The authors would to like to thank all the members of the Digital Preservation team at BnF as well as the many colleagues from the Collection departments who have supported them throughout this adventure.

Comments
0
comment
No comments here
Why not start the discussion?