Abstract – This paper will present a case study detailing the challenges and strategies adopted in the development of an information model for digital preservation at Bibliothèque et Archives nationales du Québec (BAnQ). The case study will explain how we managed to reconcile our specific needs for access and preservation with international standards and best practices. Our innovative contribution is the design of ready-to-use information packages that facilitate access without the need for additional processing. The result is the significant easing of the digital collection’s management and its accessibility. Concrete examples of information packages will be provided.
Keywords – Case study, information model, digital preservation, access
This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Kyle R. Rimkus, Tricia Patterson, Tracy Seneca and 1 anonymous reviewer. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.
Bibliothèque et Archives nationales du Québec (BAnQ) is the result of two successive mergers. The first merger in 2002 brought together the Bibliothèque nationale du Québec and the Grande Bibliothèque du Québec. The second merger in 2006 integrated the Archives nationales du Québec.
Over the years, the task of preserving all digitized and born-digital files has been consolidated under the responsibility of the department of conservation and digitization, in collaboration with the information technologies department who take charge of the access infrastructure.
In the past years, our orientations, achievements and knowledge-gathering have essentially focused on our mission of access.
Access to the digital collection originally began with the creation of a digital access repository that enabled online consultation of the national library's digitized collections of published documents. This repository, based on DSpace [1], gradually fed specialized distribution interfaces by document type: images, magazines and newspapers, maps and plans, etc. In 2009, this repository welcomed its first born-digital deposits from publishers under a voluntary digital legal deposit. At the same time, the national archives offered their own interfaces for disseminating digitized collections, such as photographs, notary archives and civil registers, etc.
After agreeing on a common metadata schema, all published and archival documents were migrated to the digital access repository. From there, we consolidated more than fifty specialized interfaces into one, BAnQ numérique, which is available through our web portal at https://numerique.banq.qc.ca/. Since 2019, users have access to exhaustive search results on all our collections, whether it be published or archival content.
On the conservation side, three copies of files were stored on DVD media and multiple servers, without metadata or checksums. In 2015, the lack of storage space for conservation files, the obsolescence of DVD media and the prospect of born-digital archive deposits redirected our concerns to our conservation and preservation mission.
The analysis and the testing of prototypes of preservation repositories based on the OAIS [2] standard were then undertaken.
The first deposit of born-digital archives arrived in autumn 2019, with the deposit of the first entirely digital commission of inquiry, the Charbonneau Commission. This was our first experience in defining an information model, moreover with an external partner.
From 2019 to 2023, raising our maturity level of preservation became a strategic objective of BAnQ plan as part of its digital transformation.
To achieve this objective, we relied on the OAIS standard, NDSA levels [3] and community of best practices, which has enabled us to define our information model. This article will detail the contextual elements and intellectual approach that guided the development of our information model.
First, we will explore the divergent aspects of our collection, which have emerged as significant challenges in our quest to develop a versatile model. Secondly, we will outline our strategies for reconciling these differences, drawing upon digital preservation guidelines. Finally, we will provide several examples to illustrate the practical application of our information model. These examples will serve to clarify how the model in real-world settings, demonstrating its effectiveness and adaptability across our various document types.
Since the merger that led to BAnQ’s creation, the institution has aimed to harmonize or even standardize its practices and tools without distorting the specificities both its library and archival mandates. To achieve this, establishing a common information model that reconciles all these specificities was crucial. In the following, we will present the elements we had to consider.
BAnQ's documentary heritage comes from the national library for published documents and from the national archives for archival documents, each entity having a different cataloguing approach.
1) Cataloging granularity: Published documents are cataloged in detail, mainly at the level of the complete work, with a differentiation considering the different editions. Archival documents are classified by fonds, with levels of description ranging from series, sub-series, or file down to the individual item. From now on, we will use the term “unit of description” to refer to the level of granularity at which documents are described.
2) Descriptive metadata schemas: Published documents are cataloged in MARC 21 [4] format, while archival documents are based on the Canadian rules for describing archival documents (RDDA) [5]. As these metadata represents a rich resource for identification and retrieval, we wanted them to be part of the information package.
BAnQ has a rich and varied collection of documents. Some types of documents, such as posters, postcards and maps and plans, can be found in both published and archival collections, while others, such as notarial and civil status records, fall exclusively into archival collections. This variety of document types leads to significant differences between the units of description.
1) Number of linked documents: In the case of published documents, a unit of description may correspond to a single document or several, such as multi-volume monographs or periodicals. In the case of archival documents, a unit of description may also correspond to a single document described individually or to hundreds of documents, whether structured or not. For example, a unit of description for an archival photographic collection may contain unstructured files, while a notary archive will bring together structured documents such as indexes, repertoires, and acts.
2) Granularity levels: Generally, a document can be represented by one or more files. In the case of published documents such as periodicals, the concept of an issue itself represented by several files, is added. In the case of archival documents, such as notarial and civil status records, the notion of grouping pages by date or alphabetical order is added.
3) Completeness over time: In the case of published documents, there are several cases where a unit of description can be enhanced with new documents. This is the case for "living" documents such as multi-volume monographs or periodicals. In the case of archival documents, however, it is rarer to have additions, given the inactive state of the deposits.
4) Frequency of change: In the case of published documents, units of description rarely change. In the case of archival documents, the reorganization of a fonds can often lead to changes in the documents attached to a unit of description.
5) Variety of formats: In the case of digitized documents, we naturally have control over formats. In the case of born-digital files, we have published a guide listing the formats recommended by BAnQ [6]. However, we are likely to receive a multitude of formats, and possibly different formats associated with the same unit of description.
6) Need for altering documents: In some cases, we may need to redact information, such as adoption notices in civil status documents. To maintain the ability to remove the redaction, we keep both versions of the document: the original and the redacted version. For born-digital documents, they may need to be "corrected" to conform to their format. For sound recordings, an edition could provide sharper sound and correct some imperfections for a better listening experience. In such cases, we retain both the original and the modified version of the file.
7) Complementary files resulting from processing: Depending on the type of document, our processes create complementary files, such as the files produced by the character recognition operation for text documents.
Since the conservation and dissemination of our files is not a new activity, we are inheriting pre-existing ways of doing things. Our capacity for change is directly linked to several factors: the scale of the modifications in terms of cost, the associated risk of error and the impact on our users. Despite these challenges, we aimed to integrate preservation and access within a single infrastructure. This strategy was even more motivated by the fact that our preservation methods were not supported by an application system and that our access solution was based on an outdated system. In seeking to improve this infrastructure, our goal was also to establish a unified workflow for both preservation and access. In this context, we based our work on our existing access model to build our information model. This desire for consolidation will lead us to use our distribution model as the basis for our information model.
Having identified the main differences linked to the diversity of our collections and determined that the organization of our files should serve both the purposes of preservation and access, the challenge was to define an information model to support this goal. In what follows, we will present our choices regarding the main elements of an information package.
Our access model is based on the unit of description. Following PREMIS [7], which defines an intellectual entity as "a set of content that is considered a single intellectual unit for purposes of management and description: for example, a particular book, map, photograph, or database.", we explored the implications of considering each unit of description as an intellectual entity.
Because of the differences between units of description described above, we anticipated several problems with this approach, such as:
- The potential need to manage information packages with a significant number of files and a large volume. For instance, the unit of description of a newspaper like “La Patrie”, covering almost 80 years, is associated with approximately 650,000 high-resolution files and 26,500 files dedicated to access, for a total volume of 6 TB.
- The need to update information package for units of description that are incomplete.
- The need to update information package following a cataloging change.
- The possibility of having to update an information package several times if it contains multiple formats and we undertake format-specific migrations.
To limit extreme cases in terms of volume and to limit the frequency of updating information packages, we have instead considered the "smallest" intellectual entity approach, regardless of the granularity of the unit of description. For each type of document, we then established the nature of the intellectual entity. In most cases, the "smallest" intellectual entity refers to the document level, whether the document is represented by one or several files. Thus, an intellectual entity may be associated, for example, with a single file as in the case of photographs, with two files representing the front and back of a postcard, or with all the files representing the pages of a monograph. For periodicals, the intellectual entity is defined at the level of each issue. In the case of more complex collections such as directories or civil and notary records, we have linked the intellectual entity to the granularity established for access.
For all conservation files for which access is planned, we generate some access files beforehand, such as OCR files. Since these files require a significant computing time, they are preserved in the same way as the conservation files and some of them deposited also in the access infrastructure. In our approach, we chose to separate the conservation and access files into two distinct information packages, a strategy which offers several advantages. This separation ensures better security by minimizing the need to update information packages containing conservation files when modifications are only made to the access files, such as format corrections or migrations.
It also allows the safe exposure of access files via the web portal without risking the security of the conservation files. Performance-wise, this approach optimizes the volume of information packages and enables the access files to be placed on a higher-performance infrastructure, guaranteeing secure and efficient data management and availability.
To differentiate between information packages containing conservation files and those containing access files, both are assigned the same identifier (uuid), but are distinguished by appending the suffix _AIP for conservation files and _DIP for access files.
Our approach, focused on the "smallest" intellectual entity, raises the importance of precisely documenting the nature and, above all, the order of relationships between information packages. While the relationship is ensured by the identifier of the unit of description, the order is not always based on the same type of information. For example, although a date metadata can be easily used for a periodical, the same does not apply to documents whose order is determined after an intellectual classification.
Since we were already documenting the order of classification when we processed the files, we decided to exploit this information to document the order of the information packages for each unit of description made up of at least two information packages. This is implemented as a JSON file ingested into the preservation repository as an Archival Information Collection (AIC) package, as described in the OAIS standard. Since the information packages containing the conservation and access files have the same identifier at the naming level, this JSON file establishes the order of the information packages, independently of their content. As regards the order of files contained in an information package, we will see later that we use the METS schema, which enables us to record this information within the information packages themselves.
As we were already using the METS schema to record all metadata into our access infrastructure, it was natural for us to continue with this schema. For descriptive metadata, while we had agreed to use the Dublin Core schema for the access repository, we modified our choice to retain the richness of our distinct cataloging methods. Thus, metadata for published documents are in MARC 21 format and those for archival documents in EAD format. We already had technical metadata in the METS file and we added, as a complement, all the metadata extracted with various tools such as Jhove and Tika. To keep the METS file as light as possible, we generate a separate JSON file for each content file. Both files are generated by a homemade script.
To ensure file integrity at various stages, but also in the event of packages being extracted from the repository, we have chosen the Bagit [8] convention. Not only are the checksums stored on the Bagit manifests, but they are also stored at the application level.
In the following, we will present several examples of information packages that illustrate the chosen information model. We use the Bagit structure and we will detail our customization of the folders and their specific roles.
This case involves an information package containing a single image. The "content" folder holds the file representing the document, below is an image in TIF format. The "metadata" folder contains the METS file related to the description unit, and the "content" folder under "tech" includes the JSON file with technical metadata relative to the TIF file under the “content” folder.
\---0a14435d-4e49-4eaa-9d7d-c33f1583b26d_AIP
| | bag-info.txt
| | bagit.txt
| | manifest-sha256.txt
| | tagmanifest-sha256.txt
| |
| \---data
| +---content
| | 464669.tif
| |
| \---metadata
| | mets-0a14435d-4e49-4eaa-9d7d-c33f1583b26d.xml
| |
| \---tech
| \---content
| 464669.tif.json
The information package containing the access file is structured in the same way, but includes the JPG access format.
\---0a14435d-4e49-4eaa-9d7d-c33f1583b26d_DIP
| bag-info.txt
| bagit.txt
| manifest-sha256.txt
| tagmanifest-sha256.txt
|
\---data
+---content
| 464669.jpg
|
\---metadata
| mets-0a14435d-4e49-4eaa-9d7d-c33f1583b26d.xml
|
\---tech
\---content
464669.jpg.json
This case involves an information package containing four pages of a periodical issue. The structure is the same as previously described, with the addition of character recognition files: a PDF, an ALTO and a text file for each page. These files are in a folder titled “staging” because they are intermediate files used to produce the access files.
\---2606372a-2ec9-4f16-b1a2-b8e8851df9ee_AIP
| | bag-info.txt
| | bagit.txt
| | manifest-sha256.txt
| | tagmanifest-sha256.txt
| |
| \---data
| +---content
| | 6454341_1969-05_0001.tif
| | 6454341_1969-05_0002.tif
| | 6454341_1969-05_0003.tif
| | 6454341_1969-05_0004.tif
| |
| +---metadata
| | | mets-2606372a-2ec9-4f16-b1a2-b8e8851df9ee.xml
| | |
| | +---alto
| | | \---staging
| | | 6454341_1969-05_0001.xml
| | | 6454341_1969-05_0002.xml
| | | 6454341_1969-05_0003.xml
| | | 6454341_1969-05_0004.xml
| | |
| | +---tech
| | | \---content
| | | 6454341_1969-05_0001.tif.json
| | | 6454341_1969-05_0002.tif.json
| | | 6454341_1969-05_0003.tif.json
| | | 6454341_1969-05_0004.tif.json
| | |
| | \---txt
| | \---staging
| | 6454341_1969-05_0001.txt
| | 6454341_1969-05_0002.txt
| | 6454341_1969-05_0003.txt
| | 6454341_1969-05_0004.txt
| |
| \---staging
| 6454341_1969-05_0001.pdf
| 6454341_1969-05_0002.pdf
| 6454341_1969-05_0003.pdf
| 6454341_1969-05_0004.pdf
The information package containing the access file, a unified PDF, also includes an unified text file to facilitate full-text indexing on the web portal.
\---2606372a-2ec9-4f16-b1a2-b8e8851df9ee_DIP
| | bag-info.txt
| | bagit.txt
| | manifest-sha256.txt
| | tagmanifest-sha256.txt
| |
| \---data
| +---content
| | 6454341_1969-05.pdf
| |
| \---metadata
| | mets-2606372a-2ec9-4f16-b1a2-b8e8851df9ee.xml
| |
| +---tech
| | \---content
| | 6454341_1969-05.pdf.json
| |
| \---txt
| \---content
| 6454341_1969-05.txt
The previous example shows an information package for a periodical. We have four issues for this periodical. Each issue is considered as an intellectual entity and is therefore wrapped as a separate information package. The relationship between information packages is recorded in a JSON file that acts as an AIC.
Depending on the case, this JSON file may simply list the order of the packets or contain classification-related information, such as, data related to the year and month of each issue.
{
"aic_identifiant": "0006454341",
"treeview": {
"fils": [
{
"type": "d",
"designation": "1969",
"fils": [
{
"type": "f",
"designation": "Mai",
"ref": "2606372a-2ec9-4f16-b1a2-b8e8851df9ee",
"chemin": "1969/05/"
},
{
"type": "f",
"designation": "Octobre",
"ref": "935df97b-105c-475b-94b8-526163b7c3bb",
"chemin": "1969/10/"
},
{
"type": "f",
"designation": "Décembre",
"ref": "446c194a-d416-45b5-a515-b48074c132af",
"chemin": "1969/12/"
}
]
},
{
"type": "d",
"designation": "1970",
"fils": [
{
"type": "f",
"designation": "Octobre",
"ref": "e5ec87c2-bd15-49ad-9c69-e2fb9b80be50",
"chemin": "1970/10/"
}
]
}
]
}
}
The structural information contained in this JSON file is essential for both access and extraction from the repository. In fact, it will enable us to control the display on the web portal and to extract all the information packages or files relating to this periodical in a structured manner.
+---1969
| +---05
| | \---2606372a-2ec9-4f16-b1a2-b8e8851df9ee_AIP
| +---10
| | \---935df97b-105c-475b-94b8-526163b7c3bb_AIP
| \---12
| \---446c194a-d416-45b5-a515-b48074c132af_AIP
\---1970
\---10
\---e5ec87c2-bd15-49ad-9c69-e2fb9b80be50_AIP
If an original conservation file needs to be modified, as in the case of redaction, a redacted file is stored in a folder called "edit" and "staging" files are generated taking care the edited file. Subsequently, the access files will contain the redacted file.
\---0a5c8562-8a2a-47aa-8728-80bcac7e25d4_AIP
| | bag-info.txt
| | bagit.txt
| | manifest-sha256.txt
| | tagmanifest-sha256.txt
| |
| \---data
| +---content
| | 6351755_2021-04-21_001.pdf
| | 6351755_2021-04-21_002.pdf
| | 6351755_2021-04-21_003.pdf
| +---edit
| | 6351755_2021-04-21_002.pdf
| +---metadata
| | | mets-0a5c8562-8a2a-47aa-8728-80bcac7e25d4.xml
| | |
| | +---alto
| | | \---staging
| | | 6351755_2021-04-21_001.xml
| | | 6351755_2021-04-21_002.xml
| | | 6351755_2021-04-21_003.xml
| | +---tech
| | | \---content
| | | 6351755_2021-04-21_001.pdf.json
| | | 6351755_2021-04-21_002.pdf.json
| | | 6351755_2021-04-21_003.pdf.json
| | \---txt
| | \---staging
| | 6351755_2021-04-21_001.txt
| | 6351755_2021-04-21_002.txt
| | 6351755_2021-04-21_003.txt
| \---staging
| 6351755_2021-04-21_001.pdf
| 6351755_2021-04-21_002.pdf
| 6351755_2021-04-21_003.pdf
So far, our information model satisfactorily covered all the scenarios arising from the diversity of our document types. The fact that we have been able to implement a consistent approach has several benefits, particularly in terms of the time it takes for our employees to adopt the model and the automation of our internal tools. The granularity of the information packages means that specific changes can be made without affecting large sets of files, reducing the risk of error. In terms of transfer, we also see an advantage in optimizing the volume of data transferred by enabling targeted queries. Moreover, our goal of optimizing our preservation and access infrastructure is achievable because we have all the elements in place to support our web portal.
The most important issue is the maintenance of structural information between information packages. We need to make sure that there are control mechanisms in place to ensure that it is kept up to date. As we migrate our files in the form of information packages, we have found that even though we have chosen the smallest intellectual entity, we are still encountering information packages that are over 30 GB in volume. Beyond this size, we began to see a drop in the performance of our ingestion process into the preservation repository. This has led us to revise the checksum control process, which is proving to be very time consuming.
The power of our model lies in the fact that our preservation files will be straightforward and transparent. They will be retrievable based on access file information, such as their public URL. This "what you see is what you get" approach will significantly simplify the work of our team in responding to preservation file requests, a task that currently occupies one full-time employe. We anticipate reducing the time spent on this task by up to 3 times, and eventually we will introduce automated delivery.
Since March 2021, we have been gradually migrating all our files in the form of information packages and uploading them to our preservation repository, which will subsequently be linked to our web portal.
In fact, decoupling access files from conservation files allows us to have more flexibility. For example, we can give access to a separate set of information packages containing access files in distinct storage or even the cloud, while taking advantage of application centralization. This significantly streamlines our systems and processes without compromising security.
Our work demonstrates how we have moved from theory to practice. The best practices available in the digital preservation community have been a great help in building our information model.
This article is intended as a token of our gratitude to the community and hopefully represents our modest contribution to it.