Skip to main content
SearchLoginLogin or Signup

Keepers Registry – A Two-Way Street for E-journal Preservation

Querying from and reporting to the ISSN KEEPERS Registry to improve e-journal preservation

Published onAug 29, 2024
Keepers Registry – A Two-Way Street for E-journal Preservation
·

Abstract – This paper gives a short overview of Keepers registry and describes two use cases for interacting with the registry: (1) using it to query data and crosscheck it against institutional holdings to prioritize holdings for in-house digital preservation (2) reporting archival holdings back to Keepers Registry to make institutional preservation activities transparent. Though Keepers registry has been available since 2013, few use case descriptions of how to interact with the registry exist. The paper describes the use cases in the context of TIB’s digital preservation processes and outlines solutions found and challenges encountered. It concludes with an outlook to further work for TIB’s workflows as well as for an e-journal preservation registry in general.

Keywordse-journals, registries, prioritizing holdings for preservation, international collaboration

This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Paul Stokes, Jack O'Sullivan and 2 anonymous reviewers. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.

Introduction

The preservation of e-journals has been a topic covered at iPRES since the very first instance of the conference in 2004 [1]. While the publishing landscape has changed significantly since then, with digital surpassing analog journal production by far today, key issues such as concerns about (post-cancellation) access still exist today. Joint efforts and services for the preservation of e-journals started to be developed around the same time as the first iPRES. LOCKSS [2] was first release at Stanford in 1998 and ITHAKA started its service Portico in 2005[3]. CLOCKSS started active operations in 2008, serving as dark archive for several publishers [4]. Other players preserving e-journals are national and large research libraries as part of legal deposit regulations or individual contracts. In such a widespread landscape a registry covering who is preserving what titles was a natural step of evolution and the Keepers registry first went live in 2013. Section II of this paper gives a short overview of the registry’s background and current state. This is followed by TIB’s use cases for the two different actor roles that exist when interacting with the Keepers registry: that of a user querying the registry and that of an archiving institution reporting to the Keepers registry. Section III describes how we structured our process to query Keepers and what challenges and lessons-learned we encountered. Section IV looks at Keepers from the viewpoint of a content holding archive reporting to Keepers and describes the process, challenges and lessons learned when generating the title and holdings list for the Keepers registry. We conclude this paper with a discussion and outlook in Section V, where we summarize what Keeper can do for digital preservation workflows and what we believe might be missing in the current registry implementation as well as in our own workflows.

Our overarching goal of this paper is a transparent description and discussion of the interaction with a (preservation) e-journal registry. We are hoping to encourage individuals and institutions to use the Keepers registry as querying users as well as reporting institutions, thus strengthening our community through the shared resources we use.

What Is The Keepers Registry

In simple terms, the Keepers registry [5] is an online database that allows searching for an e-journal based on an ISSN or title and returns the archival status of that journal amongst archives that have shared their preserved holdings information with the registry. Neither the idea nor the service are new. Building on early landscape analysis such as those by James et al [6] or Kenney et al [7] the need for a shared registry was first addressed by JISC in the funded PEPRS project (Pilot on E-journal Preservation Registry Services) between 2008 and 2013. After piloting the idea of a registry in the first phase, the cooperation between University of Edinburgh’s center of digital expertise EDINA and the ISSN International Center build the Keepers registry in the second phase of the project. The registry went online as a full service in 2013. While the registry carried on to be a success, not all PEPRS results did – a notable example is ONIX-PH, an ONIX schema for preservation holdings. Intended as a standard communication form between publishers, archival agencies and registries, it unfortunately never reached v1.0 during the PERPS project, as originally intended [8],[9].

EDINA continued to host and maintain the Keepers registry until 2019, at which point JISC announced an end-of-funding for the service. After interim funding from some of the main archival agencies reporting to the registry, Keepers was integrated into the ISSN portal on December 3rd, 2019 [10]. The ISSN International Center (ISSN IC), an intergovernmental institution founded in 1976 by UNESCO and France, is the central institution for assignment of ISSN numbers and is therefore closely connected to journal title-level metadata processes. This makes it an ideal host for a registry that aims to bring two types of metadata together: descriptive metadata on the level of the journal title and metadata on archival holdings by digital preservation institutions who regularly report to the registry.

The Keepers search takes a title or an ISSN as search criteria, though all titles contained have to have an ISSN assigned. What makes the Keepers special is the granularity of the holdings for archived materials of a title. Holding institutions are required to not only report on title-level, but actually report down to the issue level of their holdings. Only through this level of granularity, can an end-to-end preservation of a journal title be checked and assured.

The registry is free to use in its general form. Additional paid-for services that ISSN offers include API Linked Open Data access or advanced search options [11]. Institutions that are interested in becoming an official “Keeper”, an archive reporting preserved e-journal holdings to the portal, have to first submit a description of their e-archiving activity to ISSN IC. The Keepers Technical Advisory Committee then decides on the acceptance of the institution as a Keeper. Finally, a Memorandum of Understanding is signed between the archiving organization and ISSN. All archiving agencies have to give a brief descriptions of their ingest and preservation workflows, library access to content, auditing of content, policies and procedures. The availability of this description on the webpage gives users a high-level understanding of the Keepers in the context of digital preservation [12].

As of March 2024 19 Keepers from 10 countries are archiving 92 239 journal titles. 22 132 of those are archived by 3 or more Keepers [13].

TIB’s Use Case 1: Querying From Keepers

Our initial motivation to query the Keepers Registry was to use it as a risk analysis tool for TIB’s e-journal holdings concerning the availability and accessibility to our users. The question we wanted to answer was “Which e-journals in our holdings are already archived by other archival institutions?”

We ran our first larger scale query of the Keepers registry in 2017, when Keepers was still hosted at EDINA and contained about 34,357 journal titles. Around that time, an API for the registry first became available [14]. The query was repeated on the EDINA hosted Keepers registry in 2019, shortly before the system was taken offline there. In 2023, we wrote a new script to check against ISSN Keepers and reran the query there. We intend to re-run the query about every 2 years, using the results to check what percentage of our e-journal holdings are not preserved anywhere yet. This helps us answer the question concerning the risk of availability. These statistics help us making a case for in-house e-journal preservation, especially when faced with internal and external stakeholder claims that “Everything is being preserved by Portico, LOCKSS and CLOCKSS!”. In addition, we use the data to prioritize which e-journals we want to preserve.

Who would give us access?

Our first query in 2017 returned 8 Keepers who archived parts of our holdings (Portico, CLOCKSS, Scholars Portal, KB e-Depot, Global LOCKSS Network, National Science Library / Chinese Academy of Sciences, British Library and Library of Congress). While we had created different views indicating the number of agencies that archived each title, we started to think about access implications connected to the archival agency. Content is usually provided by the publisher but acontract may allow the archiving agency to make the content availabe if a "trigger event" takes place like a downtime of the publisher portal for a specified amount of days [15]. Conversations with some of the national libraries in that list as well as with publishers brought the fact to light that in many cases, triggered content can only be made available within the respective library’s country. We therefore continued with limiting our queries to those archival agencies, who could give us and our customers access to triggered content, referring to the risk of accessibility. This leaves us with, Portico, as TIB is part of the German Portico consortium, CLOCKSS, as they trigger content with a Creative Commons open access license [16] and the Global LOCKSS Network. While TIB is not a current member of the Global LOCKSS Network (GLN), GLN is an open network that international libraries can join. Out of those three archival agencies, Portico is the one that aligns with our own functional digital preservation approach the closest. LOCKSS and CLOCKSS only focus on bitstream preservation with no functional preservation processes in place [17].

Table 1 summarizes the results of the most recent 2023 query of TIB journal holdings against the Keepers registry. It shows that the majority of our titles (62.35%) are currently not archived by one of our 3 preferred Keepers. Portico in combination with the other two or alone, covers 34.79% of the titles. LOCKSS and CLOCKSS without Portico only 2.86%.

Keepers

% of TIB journals covered

None

62.35%

Portico only

10.14%

CLOCKSS only

2.44%

LOCKSS only

0.38%

Portico + CLOCKSS + LOCKSS

8.95%

Portico + CLOCKSS

10.70%

Portico + LOCKSS

5.00%

CLOCKSS + LOCKSS

0.04%

Table 1: result of 2023 query

Script-based query

In 2020 ISSN added Keepers metadata to its Linked Data Application Profile [18], paving the way to machine-readable archival status data. In addition to the API that ISSN offers as an additional paid-for service, there is the possibility to have an XML, JSON or Turtle output of every result following the regular link syntax of a web-interface query and adding the suffix ?format={outputformat}. This enabled us to write a Python script that queries the Keepers website against a list of our holdings’ ISSNs and then scrapes the relevant values “ISSN”, “ISSN Status”, “ISSN Record Status”, “ISSN-L”, “ISSN-L Status”, “Cancelled in Favor of”, “mainTitle”, “keyTitle” and “Holding Archives”. The script, which is available publically on github [19], creates a .csv file with the above listed values as well as a .txt file with all encountered holding institutions, a .txt file for all ISSNs successfully queried and a log file. The log file keeps track of any individual query and report errors, if something went wrong. Examples for errors include unknown ISSN, incomplete holdingArchive information (missing archive name while holding coverage is present), or a result record that is too small to be complete. A number of additional issues are captured via the ISSN status values, such as “cancelled” (ISSN was cancelled and replaced by a different one), “suppressed” (the journal was never published) or “unreported” (a legitimate ISSN for which no further information was reported yet).

Challenges and Lessons Learned

The script captures all holding archives, which we can then post-process using standard data analysis programs such as OpenRefine and generate statistics to answer questions such as “Which titles are not archived by Portico?”, “Which titles are not archived by Portico or CLOCKSS or LOCKSS?”, “Which titles are not archived by a European National Library?”.

The TIB ISSN input list was generated by TIB’s serials library team. They created the list based on TIB holdings listed in two national German serial holding portals: EZB Elektronische Zeitschriftenbibliothek [20] and ZDB Zeitschriftendatenbank [21]. Deviations between “valid” ISSNs in EZB and ZDB as compared to the ISSN Keepers Registry can therefore reflect the quality of any of the three sources. Around 0.17% of the ISSNs queried (for TIB holdings: n=31 out of 18 518, for TIB/LUH University library holdings: n=71 out of 40 596) were returned as “Cancelled” with a second ISSN (“Cancelled in Favor of”) being reported back. This can be indicative of a lack of updates to the ISSN records in EZB and ZDB. Less than 0.03% (for TIB holdings: n=0 out of 18 518, for TIB/LUH University library holdings: n=16 out of 40 596) of ISSNs were returned as provisional ISSNs, which had been registered but no further information provided. Rechecking them now still returns the same results in the ISSN portal, while they do have records within ZDB (e.g., ISSN: 6394-2925, “Brandschutz in öffentlichen und privatwirtschaftlichen Gebäuden”). A similar low number of cases was found for “Suppressed” ISSNs, which, according to the ISSN portal “correspond to an ISSN as a related resource (that) has never been published or appears not to be a continuing resource”. These suppressed ISSNs (for TIB holdings: n=6 out of 18 518) need further investigation, as they contain valid ZDB entries as well as a functioning journal website that lists the ISSN (e.g., ISSN 2221-0997 for the International Journal of Applied Science and Technology) [22].

We split up larger input-sets into batches of 10,000 ISSNs since we discovered that in larger batches pauses or even timeouts can occurs. An added –delay parameter to the script appeared to have no impact on these timeouts. If a timeout occurs, up to 5 retries are run – after that the script terminates with an exception. Since writing the result-set is the last step of the script, the entire process will have to be re-run if an unexpected termination of the process occurs. While we do realize that there is room for improvement, the script scaled fine for our given input set.

TIB’s Use Case 2: Reporting to Keepers

In January 2024, TIB officially became the 18th Keeper [23]. While the strategic motivation behind the cooperation was to make our preservation activities more visible and to contribute to a platform we had been using as a research tool for many years, there were very practical benefits of the cooperation as well. One, no publically accessible central overview of our e-journal archival holdings existed at that point and this would force us to create one, and, two, along the way we would be able to cross-check our metadata requirements and homogenize the data we capture on different hierarchical levels of e-journals, i.e., on title – volume – issue – article level.

The last benefit might come as a surprise, as one would expect all metadata within an archive to already be homogenized, at least within content/publication type groups such as e-journals, but life is, as always, a little more complex. Partially differences in e-journal metadata are due to growing requirements over the past 10 years that TIB’s archive has been in production, another reason lays in different metadata sources. For some workflows, metadata is passed through an external system, such as a publisher’s OAI-PMH interface. For other workflows article packages are delivered into TIB’s infrastructure, where we create SIPs with descriptive metadata extracted by us from the publishers article-level metadata schema, e.g. from WileyML 3G records for articles from the publisher Wiley.

Selecting what to report

When reporting holdings to the registry, Keepers have the possibility to set the archival status for each line item to one of three values: “In Progress”, “Preserved” and “Triggered”. While we currently have no “Triggered” content in our archive, this value is something we will be taking into consideration during our regular updates. However, in a first step, we needed to decide if we wanted to report at the “In Progress” or at the “Preserved” stage. Due to the different delivery pipelines and different metadata structures as mentioned above, reporting out prior to (pre-)ingest would add an extra layer of complexity. Only reporting out those holdings in the archival storage would allow building one robust reporting pipeline. Therefore, we decided early on to only report titles that have reached our archival storage and are hence in the status “Preserved”.

As a national subject library, TIB does not have a national legal deposit for e-journals. Collections are therefore much smaller than those of large national libraries. An exception to this is the Wiley DEAL Dark Archive, where as part of the transformative publish & read agreement between the German DEAL consortium and the publisher Wiley, TIB functions as the national dark archive entity for back and current issues of over 2.000 Wiley e-journal titles [24]. A similar agreement has been reached with Springer Nature, where the back issue delivery is expected to start in second half of 2024 [25]. Smaller e-journal collections included in the Keepers report include science & technology related titles from TIB’s own TIB Open Publishing service [26] and titles that are part of OLAM – the Oberwolfach Leibniz Archive for Mathematics [27].

In addition to all e-journals without an ISSN, two e-journal collections are currently excluded from TIB’s Keepers report: those e-journals which our digital-preservation-as-a-service customers are archiving with us – in that case, the control over trigger event access and archival rights lay with the customer and not with us – and the ChemZent database collection. The latter is an interesting case. While “Chemisches Zentralblatt” was a journal of abstracts that has been digitized and is available as an ISSN entry in the Keepers registry, ChemZent, for which TIB and CAS have negotiated a dark archive agreement, is a database derivative of the journal. In addition to the digitized abstracts, ChemZent includes translations of the abstracts into English. Furthermore, not all abstracts of the Chemisches Zentralblatt are included in ChemZent. Though based on a digitized journal, ChemZent is in our opinion a derivative database product and therefore not suited for Keepers.

Extracting information from Digital Preservation System

ISSN requires the Keepers report to be sent as a .csv file containing one line for each onlineISSN and containing the fields onlineISSN, printISSN, title, publisher, archive starting date archive ending data, status, holdings, address and date of update.

As mentioned above, we chose to generate the report from our digital preservation system. TIB uses Rosetta by Ex Libris / Clarivate as the core preservation software. Rosetta contains a “collection” functionality, which is the option to organize Intellectual Entities (IE) within a hierarchical collection structure [28]. Each level of the collection can contain metadata. While Rosetta supports many different descriptive metadata schemas as source metadata, the main metadata schema is Dublin Core. The Dublin Core record of IE is fully indexed and used in the data management component and made available in APIs. Collection metadata and collection members can be queried via Web Services [29]. Thankfully we have used the collection tree to structure our e-journal deposits from the start, treating each article as an IE and building a hierarchical collection structure from Publisher to Article: Publisher Journal Title Volume / Year Issue Article. Since each hierarchical level contains descriptive metadata in form of dc:title, dc:date and dcterms:isPartOf, we can populate the data required for the Keepers report from a combination of our collection and IE metadata.

For this, we created a Python script which takes a list of respective collections by name and ID as input and then queries Rosetta by those IDs for further metadata, building the required holding logic as well as start and end-year of the title collection along the way.

Challenges and Lessons Learned

Unfortunately, every web service output is only as good as the metadata in the system. As mentioned in the introduction, one of the key benefits of our reporting use case was an overall improvement of our metadata. Even though we had used the collection hierarchy for all our e-journal holdings, we had only leveraged rules to check for mandatory metadata fields on the level of the actual articles, as those are our Intellectual Entities (IEs). This resulted in cases, where the Online-ISSN, for example, was available at the article level but not at the top hierarchical level of the e-journal title. Other cases included the top level having an onlineISSN, but not the journal title or a missing dc:date. For identified error cases, we updated and corrected the collection metadata accordingly. To prevent these errors from happening in the future, we additionally extended our descriptive metadata policy for e-journals to include mandatory collection-level fields. As part of this policy change, we now crosscheck metadata deliveries against newly identified required fields and ask producers to include missing fields, where applicable. We also updated our SIP generating scripts to match the policy – this shall ensure that collections are automatically populated with the correct metadata when new journal titles, volumes or issues are deposited into the system. Other errors in our collection structure that we detected as part of this process included extra hierarchical levels that were created erroneously. A forward slash, which the system interprets as an indicator for a new hierarchical level during the creation process, was also included in the journal title, resulting in a separation of the title.

Overall, the initial process helped us to significantly improve both, our collection metadata and the processes that generate it. An additional benefit is the ability to now be able to easily create reports for title holdings in which our library teams can sanity check gaps, thus improving our archival coverage. And, of course, the initial requirement, i.e., a straightforward process to generate the Keepers holding lists, has also been implemented. We are currently fine-tuning the script and will make it publically available via our github later in 2024.

Discussion and Outlook

The previous sections on our two use cases already highlighted some of the benefits that we perceived through the interaction with the Keepers registry. While we had initially seen reporting to Keepers mainly as a transparency and visibility benefit, the development of the process brought significant metadata quality improvements along for us, resulting in metadata enrichment as well as an elaborated internal policy for descriptive metadata. The ability to report against a set of metadata that other major e-journal archives can provide also builds confidence in our processes. As this was our first use case where we reported out based on collection metadata it also provided a nice use case for the general functionality of our digital preservation system. Since we are only starting to report to Keepers, more benefits but, most likely, also more challenges are expected along the way.

The situation for querying from Keepers is different – here we now have over 7 years of experience and have gone through three large query runs against our entire e-journal holdings. While the initial idea to use the data to prioritize which e-journals we need to archive is still valid, the data is currently more used for case-by-case decisions rather than large scale prioritization lists based on risk analysis outcome. As everywhere, the reason for this is resource limitations and shifting priorities, e.g. to the aforementioned large DEAL e-journal deposits. However, the query outcomes have had a significant impact on the digital preservation strategy and processes at an institution-wide level. The fact that 62.35% of our journal holdings are currently not preserved by a preferred archival agency has underlined the urgency to include preservation in our licensing processes for e-journals. E-publication license negotiations now follow a cascading model, where in the first level a TIB right to preserve is negotiated. If this fails, preservation through Portico can be substituted as an alternative. Only if this fails as well, LOCKSS/CLOCKSS is an acceptable last option. Negotiation of archival rights is especially crucial for titles published outside of Germany – for those published and licensed under German law the German uses permitted by law for teaching, science and institutions under the act on copyright and related right (Urheberrechtswissensgesellschaftsgesetz §60e) apply [30]. Unless, of course, the publisher explicitly forbade this in the license contract. For Open Access materials, library teams are urged to follow a similar procedure, clarifying the license status and archival status at Portico and LOCKSS/CLOCKSS, especially for licenses outside of Creative Commons.

It is safe to say that a visual presentation in form of a venn-diagramm of the figures shown in table 1 has served as an excellent communication tool on the status of preservation for TIB’s holdings and to raise awareness for the problem across the entire institution as well as to external stakeholders. The waterfall negotiation model mentioned above is now included in TIB’s internal content and archival strategy document. In addition, the results of the query have shown that a return on investment for LOCKSS or CLOCKSS membership currently does not exist for TIB, as the coverage by those two archival agencies alone is very low.

While we will continue to improve and update our scripts for both use cases, those rely on the stable interfaces we have with the Keepers registry. Especially the webscraping approach for our query has a direct dependency on the JSON-structure reported out by the ISSN portal and on the link logic for the URLs we use. We currently consider feature breaking updated to the JSON data unlikely as it is based on the ISSN linked data application profile, which has been stable since 2020 [31]. A higher risk exists in form of ISSN no longer making JSON or alternative machine-readable output publically available, especially since the paid for service “Submit.Retrieve.Reuse.” was kicked off in 2023. Much like TIB’s approach, ISSN’s Submit.Retrieve.Reuse. service allows institutions to check the archival status of a library’s journal collections, however, it does so via the comfortable ISSN portal interface and also offers an aggregation of ISSN inconsistencies such as duplicates [32]. In the case that ISSN decides to discontinue the publically available JSON data, TIB could alternatively use the API access to the linked data via our ISSN portal license.

Besides these concrete risks and mitigation strategies to our existing workflows, we did see general shortcomings of the Keepers registry as well. The core problem of the scale that the registry uses to measure holding completeness is that it stops at the issue level. As Peter Burnhill, one of the initial architects of Keepers, already pointed out in 2013 [33]:

“There is a presumption in The Keepers Registry that all articles for a given issue and volume are safely gathered by each arching organization. It might be helpful to have re-assurance from each Keeper that what was gathered corresponded to the table of contents for each issue and volume.”

To some degree this is mitigated by the description of workflows that the Keepers have to provide. However, the publishing landscape has also changed significantly since Burnhill’s 2013 statement. An e-journal is no longer simply a digital version of a print journal in extent and media richness. Issues can now contain hundreds of articles with the table of content spanning multiple webpages that one has to click through. In addition, an article itself can contain supplementary materials, which can be either hosted on the publisher’s site or linked out to external repositories. While browser-based reading of the article relies on XML/HTML structures of a publication including embedded figures, PDF derivatives are in some cases not offered anymore. Relying on publishers to be able to say how many articles an issue contains is sometimes futile, as they themselves stop counting and reporting at the issue level. Completeness checking of e-journal deliveries is far from being a trivial task [34]. Another issue on the article level are updates such as errata or retractions. How can we know if archives receive and preserve these as well? And how do they link these versions to each other?

Natural candidates to track holdings at a finer granularity level than issues are persistent identifiers such as DOIs. Recent work done by Crossref’s Martin Even around DOI preservation and stability [35],[36] only cemented in numbers what the digital preservation community has long known: DOIs are only as persistent as the archive taking care of the object. But are registries like Keepers ready to take on that level of granularity? The secret of the Keepers registry’s success is in no small part due to the simplicity of the structure and the responsibility that the maintainer – ISSN IC – holds on the identifier – the ISSN itself. The situation for persistent identifiers like Handle, DOI, URN, ARK and others is much more complex. Should we not instead try to improve the data that is in Keepers right now, both, in quality and in extent?

Again, Burnhill had already described the same issues we encountered around ISSNs over 10 years ago. He pointed out incorrect use of print ISSN, totally incorrect use of ISSN, missing ISSN where one should have been and ISSN where none had been assigned as existing error cases [37]. However, while quality issues do exist, the error rates were very low (<0.2%). A lack of extent within the registry, especially in form of participating keepers, is a different topic. It is especially unfortunate that not all national libraries with legal deposit and digital preservation capabilities report their holdings to the Keepers registry.

The fact that existing registries are important, but often overlooked resources in digital preservation practice is being highlighted by the recent kick-off of the “Registries of Good Practice” project by the dpc (Digital Preservation Coalition). In the spirit of this project, which has the high-level goal “to help the whole digital preservation community get more out of the registries and records we have, in order to move the practice of digital preservation forward” [38], we presented two use cases for the Keepers registry within this paper. The first process described interacting with Keepers via a query, something that Tavernier, Westervelt and Carlson described as “AIM” for Audit, Identify and mandate preservation [39]. As a true digital preservation pun the second process we highlighted, that of libraries reporting to keepers must then be called “AIP” for archive, identify, present. Our hope is that the more institutions will interact with Keepers via AIM or AIP, the better the registry will become. We hope to have shown the benefit that registries like Keepers can bring to digital workflows.

Comments
0
comment
No comments here
Why not start the discussion?