Abstract – In the absence yet of a digital archive compliant with the OAIS model requirements Stockholm University (SU) is nevertheless continually developing a "Harvest Combine" tool for harvesting and transformation of metadata and data files from data repositories that are used by SU researchers, such as Figshare, Datadryad and Zenodo. The metadata are collected from these repositories, enriched with metadata from other sources, and then transformed to accord with the Swedish National Archives recent implementation (aka FGS 2.0) of the European Common Specification for Information Packages (E-ARK CSIP) and Specification for Submission Information Packages (E-ARK SIP) versions 2.1.0. The metadata records are then stored in the SU local (temporal) archive together with the associated data files harvested simultaneously. Part of the motivation for this preparatory digital preservation work are the perceived risks of trusting the digital preservation of research data files produced by SU researchers to external repositories, that we do not fully control locally. The end product of this harvest and transformation processing are SIPs (Submission Information Packages), still awaiting future transformation to AIPs (Archival Information Packages) and eventually DIPs (Dissemination Information Packages). The associated data files are not transformed or converted in this first step towards long-term preservation and archiving, but we keep track of the file formats ingested partly by mapping file extensions to mime types in a special registry xml-file, used in the transformation processing and continuously updated whenever "new" file formats are encountered for the first time.
The software scripts that we developed for this "harvest combine" are in BASH Unix Shell, XQuery and XSLT. The processing of metadata, data and scripts occurs locally using Git Bash (for Windows), BaseX and Oxygen XML editor, but is essentially software-tool agnostic. Metadata input sources are e.g. OAI-PMH feeds (DataCite or METS), repository specific APIs (to get necessary file metadata) or generic command line scripts (e.g. for checksums).
Only recently we have also created a cross-sectional Digital Preservation Group at Stockholm University Library, whose first task is to create an inventory of external and internal source information systems (e.g. data repositories), together with their local destination archival or storage systems. This inventory is aimed to serve as a basis for a digital preservation plan for future long-term preservation.
Keywords – harvesting, metadata, transformation, research data, repositories, long-term preservation.
This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Sam Alloing, Eld Zierau, Dan Noonan and Nathan Tallman. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.
Recent studies have shown that scholarly articles cannot always be trusted to be archived and preserved long-time, despite having a persistent identifier such as a DOI.[1][2][3] If that is the case with published articles, the backbone of the scientific results record, one might ask how large part of the raw research datasets, that are underpinning these results, that are similarly left without proper archiving and long-term preservation measures.
Part of the motivation for our preparatory digital preservation work by means of harvesting and transformation of research data to SIPs (Submission Information Packages), stems from a perceived risk of trusting the digital preservation of research data files produced by Stockholm University (SU) researchers to external repositories, that we do not fully control locally. External repositories like Datadryad, Figshare and Zenodo, independently of their present business models, might undergo future change of ownership or usage conditions. Further, as far as we know, currently it is not part of their services to actively monitor file formats of uploaded material, in order to identify risks of obsolescence and pending needs for conversion. Regardless of this, we are also bound by Swedish legislation and local archival regulation to retain, on premise, research data and other research information, whether published or not, for at least ten years. It is also for this purpose, that we started developing our “harvest combine” already four years ago. During this process it has become a rather complex piece of equipment, using “seeds” from several metadata sources, extracting different pieces of information that are needed to put together these SIPs in accordance with relevant archival standards. However, it is still only semi-automatic in the sense that it still requires a human driver to run – just like most real harvest combine machines.
One of the challenges for long-term digital preservation is to secure that metadata will continue to be intelligible and possible to interpret accurately, with preserved quality and FAIR-ness. All this in an ever-changing environment of metadata standards and file formats. As the Red Queen tells Alice in Through the Looking-Glass: "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!"[4] So, apparently we will have to keep on running, in order to preserve both metadata and data files, and make sure that they will be fit for potential re-use for decades ahead.[5]
One of the main tools of adaptation and keeping pace with the evolution of new standards, formats – and versions of standards in this ever-changing environment are validation schemas. Validation schemas are mainly seen as methods of checking data quality and fitness for use, but are also important for long-term preservation.[5] They are keys to interpretation of metadata standards and terms. Therefore, validation schemas should ideally be kept together with the information packages that they define.[6][7] This is something that we implemented only recently in our “harvest combine”. It was first developed for our Figshare for institutions instance (at su.figshare.com), and then only with links to relevant validation schemas in the schema location attributes of the resulting output of metadata transformation. The recent changes also involved a shift from a national Swedish archival standard (FGS 1.2)[8], to the European Common Specification for Information Packages (E-ARK CSIP) and Specification for Submission Information Packages (E-ARK SIP) versions 2.1.0, adopted by the Swedish National Archives in 2023 as package structure FGS 2.0.[9]
We first started developing our harvesting and transformation tool for our institutional Figshare instance at su.figshare.se, as indicated above partly motivated from a perceived risk of trusting a commercial agent, Digital Science, as the sole responsible proprietor for future access to the records and files, let alone long-term preservation measures. Initially, it also seemed Figshare would make for a good start, since they were providing, as one of several export metadata standards, OAI-PMH feeds in METS format. Apparently, this is still a unique feature of Figshare among the data repositories that we have encountered. This appeared to be a good fit, with METS as the preferred “wrapper” archival format recommended and used by the Swedish National Archives. Thus, in the beginning we expected, perhaps somewhat naively, the transformation to Swedish national archival standard, then still FGS 1.2 [8], to be easily achieved, from METS to METS. However, we soon discovered that the original METS feeds from Figshare, retrieved via OAI-PMH, were far from sufficient as the only metadata source. To begin with, at the outset they did not even constitute well-formed XML, before removing namespace prefixes from attributes. Furthermore, the mandatory structMap element was missing in the output.[5] These early faults were soon fixed after a dialogue with Figshare support staff and developers. More important was the fact that the METS feeds still did not provide the necessary file metadata, such as file sizes, mime types or original file names. For this information we had to find other metadata sources, such as the Figshare API, by means of which we could also capture the custom metadata field content, e.g. for department at SU, that we had added to enrich the item records in our local Figshare instance.
To get hold of the mime types, we constructed a map from file extensions to mime type, which is continuously growing as we encounter new file formats used by our researchers. This is then a partly manual process, were we regularly consult external sources such as datatypes.net, fileinfo.com and PRONOM format registry, to hopefully find a suitable match, well aware of the fact that there is often not a one-to-one correspondence between file-extension and mime type, as sometimes several different formats use the same file extension abbreviation. Despite this ambivalence, we prefer using mime types, rather than PRONOM PUIDs for file format identifiers, as many research data file formats are missing yet in PRONOM. This is sometimes a time-consuming endeavour, but we still consider it worth the effort, since it regularly allows us the opportunity to keep track of the file formats we will have in store. (An early warning and trigger for this effort was, in fact, a poster presented by Sheffield University at IDCC 2017 in Edinburgh, showing the results of an internal file format “audit”, by means of DROID, that revealed more than 70% of file formats in storage to be “unrecognized”.[10] We certainly want to avoid a similar situation at Stockholm University in the near future!)
Earlier versions of the harvest combine [11] had just two basic parts or modules:
1) An XQuery module, a script run in a free software BaseX processor for extracting file metadata (via the Figshare API, but with the OAI-PMH METS feeds original metadata as first input argument, from which then the DOI is then extracted to become the input argument for the API. The XQuery script is also used for splitting up METS feeds (usually comprising ten item records) into individual, distinct records, that will then serve as the base for further processing into information packages. After this split-up, the individual item records will then be moved to new individual directories/folders of their own, were they will end up together with the associated data files, that are later fetched through the same script by means of the BaseX fetch module.
2) An XSLT style sheet, used for the final transformation of original metadata to comply with the FGS (then still version 1.2).
Only later in the development a third component was added [12]:
3) BASH shell scripts: used first to “automatically” retrieve (by means of curl) and name the original OAI-PMH METS feed, and later, after the split-up of feeds performed in BaseX, move the individual item records to new individual directories (or folders) of their own, created through the same script.
The latest version of the harvest combine that was published on Zenodo is from 2020 and still only for our Figshare instance. Since then further development and “automation” has been introduced, e.g. by the inclusion of another BASH shell script, figFeedFirst.sh, for retrieval and consistent naming of xml-feeds files by time interval covered, that is the first and last publishing dates of items included in the feed. A typical example looks like this then:
figsMETSfeed85-api20240215until20240318.xml
An overview of the scripts that make up the harvest combine for Figshare at Stockholm University are shown in Fig. 1 below. As the file name suggests, the plain text file, figsMETSfeedsURLnrList.txt, is simply an ordered list of URLs, with the latest on top, used for retrieval of METS feeds holding, as a rule, ten items each, published during a certain time interval at our SU Figshare portal (su.figshare.com).
The figFeedFirst.sh BASH shell script then operates on this list, taking the latest URL as an ‘argument’ for retrieval of the corresponding feed by means of curl, and at the same time creates a local named folder holding the feed and all the subsequently created individual item folders, both metadata and datafiles, that are contained in the feed. The script itself is rather short, fifteen operative lines, without comments. The output on screen is even shorter, with the last two lines echoing the names of the new created folder and the thus retrieved feed METS xml-file, e.g.:
figsMETSfeed85pacs
figsMETSfeed85-api20240215until20240318.xml
Next, the xquery script, extractFigsFileInfo.xq, is used in the BaseX processor first simply to split up the feed into ten individual items. After this we return to Git BASH, reusing the variables from last call that are still on screen, to invoke the dir-mvOrigMDfig.sh script in order to put each of the individual items, each named by their corresponding DOI suffix, that were split up from the feed, in a similarly named folder of their own. At the same time, the required validation schemas are retrieved and stored in a subfolder, named schemas, in each item folder.
The last steps in this processing (represented by numbers 5. and 6. in Fig. 2 above) involve using again the xquery script, extractFigsFileInfo.xq, for each item, now to get the required file metadata (e.g. checksums and original file names) and local custom metadata (such as researcher’s department at SU) by means of the Figshare API simultaneously. Most important, this is also when the actual data files are fetched. The final step of this process is applying the XSLT script, figMETS2fgs.xsl to each item, together with their just created additional file metadata xml-file, file_info.xml, then used as a parameter document in the transformation. This, then, is what transforms the metadata of each item package to the current FGS standard and creates the new METS.xml metadata file (formerly named sip.xml). This transformation also makes use of the above mentioned mapping from file extensions to mime types, which formerly used to be part of the of the XSLT script proper, but now resides in a master xml-file serving as another parameter document in the process, a copy of which is also included in each package as part of the provenance documentation. This, ever evolving mapping xml-file, filext2mimetypeMapMAIN.xml, has hereby also become “common property” of all subsequent new “modules” of the harvest combine, with those developed for Zenodo and Dryad next in line.
It seemed like a natural second step after Figshare to develop our harvest combine also for Zenodo, where we have a “community” named Stockholm University Library, but like our Figshare instance open to all researchers at SU. However, unlike Figshare, until recently there was no possibility of actively curating records in Zenodo before they were published. This means there was actually no quality control of metadata from our side. All we could do was to either accept or reject records for inclusion in our community. It was still desirable to have as many records as possible from our SU users of Zenodo as part of our sub-community, where we can more easily keep track of them and their scientific output.
Another difference between Figshare and Zenodo, of particular importance for the possible adaptation of the harvest combine, was the fact that Zenodo did not offer any OAI-PMH feeds in METS format. We settled then instead for DataCite (metadataPrefix= datacite), as one of the most universal export formats of data repositories. This involved also a substantial shift in the XSLT stylesheet, and notably in the output METS.xml file descriptive metadata element (dmdSec) from default metadata type Dublin Core (@MDTYPE=”DC”) in the Figshare METS feeds, to DataCite(@MDTYPE=”OTHER”), still lacking own representation in the METS specification. However, this in itself did not constitute any problem. In fact, just as for Figshare METS feeds with Dublin Core in the dmdSec, it still allowed for large parts of the original Zenodo OAI-PMH feeds with DataCite metadata simply to be copied (by means of <xsl:copy-of>) to the resulting dmdSec in the output METS.xml file. In other respects, the Zenodo module of the harvest combine was fashioned after its Figshare predecessor, in particular for the fileSec and structMap elements in the resulting METS.xml. The use of the scripts for Zenodo is also similar to that for Figshare, except there is (yet) no first automatic “feed fetcher” like figFeedFirst.sh, and instead of a double purpose extractFileInfo script, there is a separate miniSplit.xq script for doing the split up of items from a downloaded feed. The most recent public version of the harvest combine, including the scripts for Zenodo, is now on GitHub. [13]
Only recently did we start to develop a corresponding model also for Dryad (datadryad.org), as we discovered that many of our SU researchers, particularly within ecology, environment, plant science and zoology use this repository quite extensively. In fact, it seems the number of research datasets from SU researchers in Dryad is actually larger than the corresponding number in our local Figshare instance. Dryad also offers a convenient way of retrieving all dataset records in their database with at least one SU affiliated author or creator, by means of an API with our SU ROR Id as argument.
However, there are other difficulties and problems with Dryad, some of which we have not met before. First, since SU is not a member organization of Dryad, as with Zenodo until recently, we have no possibility of curating datasets before they are published. This may not be such a big issue here, though, given that Dryad staff perform their own curation, that seems to vouch for fairly good quality anyway. But, secondly, the problem is just that Dryad records do not seem to follow any common metadata standard, nor are there any metadata standards offered as export formats, as far as we have found. This necessitated an extensive mapping effort when creating the XSLT stylesheet for transformation to METS – again with DataCite in the dmdSec.
By the time of writing, we only just started the development of the harvest combine for Dryad, already more than 370 datasets had been deposited and published in Dryad by SU affiliated researchers over the years. This implies we have a substantial backlog of records and datasets to harvest, download and transform. We have only just begun to work us through this backlog. The processing is still slow, as we still encounter new, unforeseen stumbling blocks on the way. Several of the Dryad datasets have proved to contain a very large number of separate data files, as many as 180 files (tif images) were found in one dataset recently. The problem in these cases seemed to be that the file metadata “feeds” retrieved by means of the API only took a limited number of files at a time. Nevertheless, by means of further development of our scripts and a deeper insight into the Dryad API we now get the all the necessary file metadata attributes into the fileSec of the METS.xml output file, even in cases with a large number of files. However, the file fetching process for larger size datasets (in GBs, not necessarily having many files), is still time consuming, and in some cases has to be effected “manually”.
The complex structure of the Dryad metadata records, without any common metadata standard and lacking, at least until recently, required file metadata attributes such as checksums, has made it necessary to add yet another shell script to the harvest combine stack before the final XSLT transformation can take place. Yet, we are still confident that we will finally get to the point where all the now more than 370 Dryad datasets have been harvested, their metadata records transformed to valid METS.xml files and all together safely deposited as preparatory SIPs in our local archive to be monitored for long-term preservation, and later further transformation to AIPs and DIPs.
Naturally, digital preservation is not only about the sometimes tedious, daily work of collecting, recording, transforming, validating and storing information items. It is perhaps most of all about looking into the at least foreseeable future, to prepare for what may come sooner than we might think, to do risk assessment and monitor closely which file formats might soon be subject to obsolescence.
For this purpose also, we have only recently formed a cross-sectional Digital Preservation Group at Stockholm University Library, whose first task is to create an inventory of all (external or internal) source informations systems, such as data repositories, together with their local destination archival or storage systems. Here we will also register information about common file formats used, metadata standards employed, identifier types (for persons, organizations, information items), collection methods, processing tools, validation schemas and transformations that are associated with each source system and target storage or archival system. This inventory, then, aims to serve as a basis for a digital preservation plan for future long-term archiving.
The still evolving harvest combine at Stockholm University harvests data files and transforms original metadata from several sources and repositories. It is a semi-automatic software tool, a collection of scripts for retrieval and transformation of research datasets (and some other research outputs, such as reports, software scripts, presentations), that still requires a human driver to run it. It prepares for long-term preservation in local archive by enriching and transforming original metadata to accord with Swedish and European archival standards for SIPs – Submission Information Packages. It contains a continuously evolving mapping xml file, to help us control which file formats we have in store. It is using only open (non-proprietary) file formats for scripts and in processing (json, xml, xslt, xquery). It is implemented locally using Oxygen XML Editor (licensed software), Git BASH for Windows (free), BaseX (free). The newly formed Digital Preservation Group at Stockholm University Library has begun an inventory of “source systems”, i.e. systems that might produce outputs in need of digital preservation. Together, following the Red Queen, hopefully this all will help us keep pace with the evolution (and degradation) of file formats and metadata standards, in order to - at least - “stay in the same place” while we keep on running!