Skip to main content
SearchLoginLogin or Signup

Cloudy Data with a Chance of Transfer: Towards SharePoint Transfer at UK Parliament

Published onSep 09, 2024
Cloudy Data with a Chance of Transfer: Towards SharePoint Transfer at UK Parliament
·

Abstract – Since 2020 the day-to-day business of UK Parliament has moved to predominantly cloud-based ways of working and collaboration. It has moved away from using an on-premise Electronic Document and Records Management System (EDRMS) with data stored on network and personal drives; to a cloud-based Microsoft 365 environment with the teams from both Houses storing and sharing data in OneDrive and SharePoint. This move created a need for a solution to transferring information of archival value out of its cloud-based collaborative platform and into long-term storage and preservation whilst retaining the context within which they were created.

Transfer of information is complicated by the nature of cloud-based collaboration and storage in several ways. Firstly, when files can be viewed, edited, stored and shared by multiple users concurrently, it can be difficult to establish what the authoritative version of a file is. Secondly, with the proliferation of user applied metadata tags and labels, the context in which the files were created form as much a part of the archival record as the files themselves. Thirdly, there is the question of how to validate and authenticate files extracted from the cloud.

The preservation of this data moves away from preserving just the individual file to capturing the file and its contextual metadata. This paper describes the Parliamentary Archives efforts to explore and test the transfer and authentication of archival data from the cloud and into their digital repository.

Keywordscloud-based collaboration, SharePoint, transfer, workflows and processes, validation

This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Sofie Ruysseveldt, Eliane Ninfa Blumer, Sam Alloing and 1 anonymous reviewer. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.

Background

The Digital Preservation Coalition’s (DPC) Digital Preservation Handbook’s most recent edition calls digital preservation, “the challenge of a generation” and goes on to say that “[d]igital collections can derive from laptops or desktops or smart phones; from tablets, souped-up servers or hulking great mainframes. They can be snapped at the end of a selfie stick or beamed from sensors deep in space; they can be generated by tills and cash machines, by satellites and scanners, by tiny sensitive chips and massive arrays [1].” Implicit in this statement is a single point of creation for digital collections using local infrastructure. But the pace at which digital technologies evolve means that often this is no longer the case.

More and more organizations are moving away from using and maintaining on-premises servers, systems and local networks towards cloud infrastructure for storage, sharing and collaboration. Digital preservation practitioners need to play catch up with the changing digital landscape as it creates new challenges for organizations and their archives.

While this has made some things easier, more secure and accessible it has also created a set of unique challenges for archives and digital preservation. Infrastructure and software that is no longer maintained, hosted or managed locally has moved the archivist further away from the point of creation. Data is dispersed in the cloud and infrastructure as a service procured from third party providers. Additionally the data itself is being created, shared, used and organized differently.

Much of the digital preservation advice and guidance currently available is focused on the preservation of the individual file or set of files originating from a local device or system. These files are copied and transferred using hash values to establish their fixity, data integrity and the chain of custody from depositor to repository. Hash values are digital “fingerprints” generated by hashing algorithms. Good hashing algorithms reduce the likelihood that a fingerprint is replicated. According to the Digital Preservation Coalition they can be used for:

  • Detection of data corruption or loss when data is stored, for example when keeping data on disks, tapes, USB drives or in the cloud.

  • Detection of data corruption or loss when data is transferred, for example when sending files over the internet or uploading/downloading them from a server.

  • Detection of deliberate and malicious attempts to alter the contents of data, for example detecting if someone has deliberately tried to tamper with a document.

  • Detection of identical copies of the same data in different files or objects, for example looking for the same contents in multiple files that have different names or locations.

  • Confirmation that handover of one or more files between people or organisations has been done successfully, for example verification that a set of files has been successfully transferred to an archive by a depositor.

  • Making an inventory of data in preparation for future preservation or archiving activities, for example when making a record of the contents of a hard drive before the contents are extracted and ingested into an archive.

  • Making strong and long-term assertions on the integrity and authenticity of data, for example as part of guarantees or testaments that data hasn’t changed over time [1].

While non-cryptographic hashing algorithms are suitable for many of the criteria listed above, for “detecting malicious tampering of data, or when checksums are used to unambiguously locate multiple files that have exactly the same contents [1]” a strong cryptographic hashing algorithm such as SHA256 or SHA512 is recommended. SHA256 is available in DROID [2] a file format identification tool used by many in the digital preservation sector and integrated into the workflows of a popular commercial digital preservation software solution, Preservica.

With organizations using cloud infrastructure—and increasingly cloud-based collaborative platforms—long term preservation of data becomes more difficult as data needs to not only be extracted from these platforms but their fixity established as well.

Context

At UK Parliament, the initial roll-out of Office365 in 2013/14 allowed teams to opt-in to using SharePoint 2013, an on-premise collaborative workspace for organizations. However most files were still stored on traditional local network and personal drives and managed through an Electronic Document and Records Management System (EDRMS) SPIRE that was implemented the year before. It wasn’t until SPIRE was reaching end of life in 2016 that the decision was made to decommission it and fully replace it with the Microsoft 365 (M365) ecosystem. M365 was chosen for its ability to meet Parliament’s requirements for auditing, custom metadata and versioning. Additionally, it was hoped that its record management tool Compliance Centre (now Purview) could be used for future disposal and export needs.

The decommissioning of SPIRE saw the mass export of files from the EDRMS, some half a million of which needed cataloguing and ingest into Preservica, the Parliamentary Archives’ digital repository system. It also kicked off a multi-year project (2017-2020) to move Parliament away from network sharing and personal drives hosted on aging on-premises servers to a predominantly cloud-based way of working and collaboration in M365 and SharePoint Online.

The Information and Records Management Service (IRMS) at Parliament developed and keep up to date a robust classification scheme and disposal policy in the Authorised Retention and Disposal Policy (ARDP) [3].

The ARDP outlines retention schedules and disposal instructions for all information held at Parliament. “Information [is] retained only for as long as it is required to support the Houses in meeting their business requirements and legal obligations, for reference or accountability purposes, or to protect legal and other rights and interests [4].”

The ARDP was used to structure and organize the file and folder structures in SPIRE, and IRMS worked closely with Parliamentary Digital Services (PDS) to map these legacy file and folder structures to new SharePoint site libraries with user defined metadata columns. A network of Record Officers (RO) embedded in teams provide targeted support and guidance on the application of the classification scheme and appropriate retention labels in site libraries.

Data in SharePoint that reached the end of their retention period are automatically marked for disposal. However Parliament’s low risk appetite means that all data are reviewed before disposal, with a decision made to either save them for permanent preservation and transfer them to the archives (and consequently the digital repository) or delete them. Manual review means that deletion happens slowly and transfer to the archives before this project, not at all.

Upon transfer to the archives all information should be prepared, catalogued and appropriately stored. If digital in format it, should be ingested into the digital repository.

Why Not Wait For A Solution?

Business Need

The Parliamentary Archives were increasingly being approached by teams asking about information that they had ready to transfer to the archives. Having the data sitting in their site libraries was making it harder for teams to find current business information that they needed amidst old and out of date information.

Increasing Backlog

There are approximately 150 transfer instructions in the retention policy and 1000 SharePoint libraries with a transfer or joint disposal instruction. This is about a third of all Parliament’s SharePoint libraries. This building backlog of data has reached the point where it is adversely affecting the searching and filtering functionality in SharePoint.

Proactive vs. Reactive

Dealing with the previous EDRMS’ decommissioning took hundreds of working hours for just the export of the records. Cataloguing and ingesting them into the digital repository was a multi-year project that is only now nearing its end. This was for an EDRMS that was in place from 2012/13-2018/19. The M365 ecosystem (and SharePoint) has been in place for nearly as long and the shift to fully remote working during the Covid-19 pandemic has meant that data in our cloud systems are growing at a pace that needs to be dealt with sooner rather than later.

All these drivers pointed to the need for us to tackle transferring data of archival value out of SharePoint.

In 2022 when we looked at what others were doing in the archives sector very few had done any transfer from cloud-based collaborative systems.

We were aware of work by colleagues at the National Archives on transferring documents from Google Workspace however discussions with Daly, S. and Gardner, R. in December 2022 found that it could not be applied to SharePoint.

Preservica recently came out in January of this year with a commercial solution in Preserve365 [5], but the demos we saw before we embarked on our project did not seem to offer the functionality we needed. Preserve365’s implementation has files marked for the archives moved directly from SharePoint into Preservica whilst remaining searchable and accessible within SharePoint. This did not fit our collections management model which holds our “master” catalogue independently from our digital repository.

We were also in the process of moving our collections over to the National Archives and reassessing the digital preservation and cataloguing requirements and functions that would remain in house. Because of this future uncertainty we wanted to pursue a transfer process that was as systems agnostic as possible.

The Digital Preservation (DigiPres) team and IRMS decided to collaborate on a project to explore using SharePoint’s native functionality to export data from its ecosystem and build working processes that could lay the foundation for future scalability.

Project: SharePoint Transfer

We saw the need for an in-house solution to transferring data of archival value out of the cloud and into long-term storage and preservation to reduce the bloat in our system. The traditional process that we use for transferring and ingesting files into our digital repository is not fit for this purpose. Our previous born-digital record transfers had meaningful folder structures arranged to follow well defined team, function, dating and naming conventions. The ability in SharePoint to tag and arrange files through user applied metadata labels and instructions means that the contextual metadata around our files is as important for the archival record as the files themselves and we need to preserve this data.

For this project we aimed to build a trusted process for transferring data out of SharePoint and into our digital repository, Preservica, that we could then scale up. Our goal was to first find a method in which we could validate extracted data and show that it also retained its contextual metadata from SharePoint. We would then look to begin integrating SharePoint transfer into our business-as-usual practices through adapting and refining our existing workflows and processes.

Method

The biggest challenges we face with data in a cloud-based collaborative ecosystem is: 1) defining what the authoritative version of a record is when the point of cloud-based collaboration is that it can be shared and edited widely and concurrently, 2) extracting data from the cloud with metadata intact, and 3) being able to validate that data we put in the repository is identical to what is in the system we extracted it from.

Defining the “Authoritative” Record

The design of cloud-based collaborative systems like SharePoint encourages open and concurrent sharing and editing of files. It is one of its key benefits as an EDRMS – colleagues can easily co-author files and users have a great deal of flexibility and control over the organization of information and application of user defined metadata.

This is great from a collaborative and business point of view but challenging in terms of identifying what and where information is that needs to be transferred to the archives, and when found which version of it is the authoritative version.

Some of this is addressed by good information management practices at the organizational level. With our previous EDRMS the IRMS worked with business units to design fileplan structures, agree on access permissions, and tidy up shared drives. Comprehensive guidance around categorizing information, filesharing, and general good data hygiene was provided and continually updated and disseminated. IRMS and Productivity and Collaboration (P&C) also implemented a disposal process with clearly defined disposal and transfer instructions and organization wide retention labels built into site libraries and embedded into SharePoint’s design. This disposal process is supported by an Authorised Retention and Disposal Policy (ARDP) that provides instructions on how long to keep information, when information can be destroyed, and when information should be transferred to the Parliamentary Archives. It is split into areas representing key functions of parliament with further subdivisions by activities and descriptions.

All information in SharePoint has a retention schedule decided at the beginning of its lifecycle by teams with the help of IRMS and the ARDP. There is a 1 to 1 correspondence between libraries and retention instructions. Teams also have a nominated RO that receives regular training, communication and assistance from IRMS to make sure that retention schedules and classifications are followed. Version control is in place for data held in the libraries and teams are encouraged to share files through viewing and editing rather than sending out multiple copies to proliferate in the wild.

When files reach the end of their retention period they are flagged for transfer. After a consultation with teams, the site library is locked down and files are reviewed for disposal, either permanent deletion or transfer to the archives.

Close collaboration and management of relationships with the Parliamentary Digital Services (PDS), P&C, and the site owners ensures that site structures, retention schedules and management of information is adhered to. While not perfect, we can at least know that what we are transferring to the archives is what the information asset owners (IAO) have agreed is the authoritative version according to a set of well-defined and established principles.

Initial Pilot Design and Criteria

What criteria do we use to technically validate data in a cloud-based environment and how do we extract that data from it?

For the pilot we looked at ways to validate data in SharePoint and extract files with their contextual metadata intact. To do this we defined our criteria and then tested a set of validation and extraction methods using SharePoint’s native functionality. The decision to use native functionality was driven by a desire to keep things as simple as possible in the initial testing.

First we defined the properties and metadata we wanted to check in order to meet our criteria. This involved generating hash values for files, checking commonly altered metadata (e.g. creation date, last modified, etc.), and finally a visual check of data extracted.

We then looked at how we could test these methods qualitatively and quantitatively. With each chosen method we looked to see if they could give a reproducible result against the criteria defined previously. This would show the fixity of the data we were extracting. Finally we would import the data back into its native environment to see if files retained their contextual information intact (e.g. whether they carried their user applied views, labels and site library permissions with them in their metadata).

Validating the Transfer Process

We started with a mini-pilot to test our core transfer process on 4 folders from our own team’s SharePoint site. The first property we needed to check was our data’s hash values. Typically, to quantitatively validate data we run a hashing algorithm against the original files which generates a hash value that acts as a unique alpha-numeric “fingerprint”, any changes to the data would change its “fingerprint”. When data is ingested into our digital repository, we use that hash value to ensure that the data transferred is the same as the original and for continued validation of data integrity within the repository. One of the issues we encountered when transferring data was the difficulty in authenticating data in this way against that hosted in a cloud-based environment.

With local files or files on removable media, we run a DROID report to capture its hierarchical folder structure, identify file types and generate hash values to verify that copies made are identical to the original. DROID generates hash values for this with either SHA256 or MD5, the DPC recommends using SHA256—a strong cryptographic hashing algorithm—to guard against deliberate data tampering [1]. Files stored in the SharePoint environment are not easily checked in this way. SharePoint and the M365 environment all use a non-cryptographic hashing algorithm designed by Microsoft for checking file integrity [6] [7]. Designed specifically for the needs of the M365 environment it is built for speed, efficiency and lower computational needs. Because of this design it is less unique, more prone to collisions and vulnerable to tampering compared to what is recommended by the DPC and does not offer the level of data integrity assurance we need for long term preservation. It is also not supported by our digital repository which uses DROID for integrity and fixity checking. For our pilot we chose SHA256 a commonly accepted standard in the digital preservation sector and supported by DROID as the hashing algorithm we would use to validate our files. However, SharePoint does not allow you to run executables in its environment.

As we could not validate what is in the SharePoint ecosystem directly, we had to find a way to validate it indirectly. This was done by repeating extractions of the same files in different environments and seeing if the hash value changed. If it didn’t then we could quantitatively prove that it was the same files being extracted. Once we were able to do that we could then check the files against our other criteria to ensure that data was extracted in their original form and with their contextual metadata intact.

If this could be done we would then develop a process and workflows that used our current disposal processes as a springboard. This meant that we would have a process that would already be somewhat familiar to the wider organization but also give us the opportunity to streamline and refine processes that had developed ad hoc over the years. The aim was to start small and develop and refine it iteratively, incorporating lessons learned from each stage of the project.

Results

Initially we tested the regular process of downloading files from SharePoint onto a local device. We ran DROID reports on them in our staging area and downloaded them a second time from SharePoint but found that their hash values did not match upon subsequent downloads. We also compared their metadata for creation date and last modified against what was shown in the SharePoint environment and found that they also did not match.

Table 1 : Comparison of SHA256 hash values from different accounts on different local devices

We were then able to move on to our other validation criteria. We saw that the metadata for creation date and last modified matched that in SharePoint. Visual checking of files against those in SharePoint showed the same layout and formatting and were visually identical. With files on our local devices, we could only assess the individual file and its technical metadata in our DROID reports. Our last check was to import the files back into the SharePoint cloud environment and see if they retained their user applied metadata labels, views and retention instructions. These were imported back in successfully and we could be assured that the files’ contextual metadata was saved as part of the file.

Ongoing Work

With our core transfer process in place, we have moved on to larger site libraries and testing a transfer workflow adapted from the workflow already used for born-digital records transferred via local devices and removable media. Our largest test to date has been on the Parliamentary Archives “Control and Disposal” library which held circa 800 files. This phase of the project aimed to test the full process: including communications, reporting, transfer, cataloguing into our collections management system CALM and ingest into Preservica. This would test our process on a live library to see how it worked on a larger scale, with a greater variety of file formats as we would likely encounter in a regular site library.

Figure 1 : Transfer Workflow

Working Process

Figure 1 shows the steps:

  • IRMS engages with teams on disposal/transfer of information and assist in identifying information for transfer

  • IRMS send communications to embedded RO to complete Transfer Authorisation Report (TAR) (See Figure 2)

  • IRMS restrict access to site once TAR submitted

  • TAR processed and assigned intake number

  • Data exported from SharePoint

  • Data and TAR transferred to DigiPres

  • IRMS update intake log and IAO/RO

  • DigiPres create accession record in catalogue

  • DigiPres prepare data for ingest

  • Metadata mapped to schema

  • Catalogue record stubs created

  • Data mapped to record hierarchy

  • Data ingested into digital repository

  • Data deleted after ingest

  • DigiPres inform IRMS that data has been successfully ingested

  • IRMS delete data from site library

Figure 2 : Transfer authorisation report

Conclusion

This paper covers the pre-project planning and initial phase of a longer-term project in Parliament to tackle the transfer of data of archival value out of the cloud and into long-term storage and preservation. With our mini-pilot we built a core transfer process that we were able to use to validate data extracted quantitatively and qualitatively and tested it on a small scale. We then mapped out and have begun a five phase project to embed SharePoint transfer into our business-as-usual (BAU) work.

Phase 1: which this paper covers found us adapting current processes and workflows for transferring born-digital records from local storage and removable media and testing them on larger site libraries with more varied files. Testing was focused on libraries that were not in active use. This testing was accomplished using only the native functionality in SharePoint and provides a method of transfer from a cloud-based collaborative platform that can be done without additional commercial tools or products. It does rely on a well-established culture of information governance and engagement, communication and collaboration with IAOs and is very much geared towards an archive and information management team that is embedded in its organization. This is also still a very manual process though we see a great deal of potential in its scalability with the addition of simple scripting tools which we are looking to develop in later phases.

The future direction of our research and areas we want to develop will be explored in the next phases of our project as follows:

Phase 2: which is due to begin in mid-late 2024 will see DigiPres and IRMS work with teams on live sites and libraries. The plan is to work with them while refining communication lines, guidance and workflows. The focus will be on analyzing processing times, challenges encountered in the process, lessons learnt and opportunities for improvement. Concurrently we would like to explore the possibility of automation and scaling up with custom scripts for populating record templates with metadata and leveraging power apps to simplify metadata data extraction and OneDrive Sync.

Phase 3: target teams with more complex metadata labelling. Work with them to see how they are using these metadata labels and how best to capture them meaningfully in catalogue records. Refine processes and explore how the information in metadata levels can be linked.

Phases 4 and 5: will see us approaching teams proactively, work on communication and processes and embedding them into our business-as-usual activities.

Comments
0
comment
No comments here
Why not start the discussion?