Skip to main content
SearchLoginLogin or Signup

Training AI Models From Within A Digital Preservation System

Ensuring Machine Learning approaches can learn from users

Published onAug 30, 2024
Training AI Models From Within A Digital Preservation System
·

Abstract – Artificial Intelligence and Machine Learning tools are highly dependent on the datasets that are used to train them. Building datasets for general purpose usage is something that Commercial-Off-The-Shelf tools and services have done well, but they are not necessarily tuned to domain specific needs. Creating custom datasets for this purpose is often possible with these services, but is often complex or imposes complex requirements on an organization seeking to do it.

This paper describes work undertaken to allow the refinement of datasets and re-training of custom models to be performed through simplified user interfaces that can be directly embedded within an existing digital preservation system.

KeywordsArtificial Intelligence, Machine Learning, Datasets, User Interfaces

This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Dr. Sven Lieber, Klaus Rechert, Leontien Talboom and Panagiotis Papageorgiou. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.

Introduction

Artificial Intelligence (AI) refers to the broad field of technologies that enable machines and computers to mimic functions and behaviors that are associated with human intelligence. Machine Learning (ML) is a subset of this field referring to technologies that specifically change how they perform over time, as they are exposed to more data [1]. There are many good primers, educational resources and introductions to AI and ML and the related concepts (e.g. [2], [3] and [4] among others) and we do not intend to try to add to that in this paper.

Recent years have seen significant advancements in the field of ML and AI. This has produced an emergent ecosystem of open-source tooling as well as a competitive AI-as-a-service and AI product marketplace.

Archival practitioners are keen to make use of AI technologies to complement their activities [5]. There are opportunities to exploit AI throughout the archival lifecycle, including but not limited to appraisal, selection and sensitivity review, metadata enrichment, Personally Identifiable Information (PII) detection, hand-written text and optical character recognition, AI-assisted search and much more.

As a theme, metadata enrichment alone has many potential outlets for established AI technologies including computer vision approaches for identifying and classifying entities within images and video, natural language processing for named entity recognition from textual works, the use of large language models to generate narrative summaries of data, AI-enhanced transcription of video and audio, AI-assisted translation services, and much more. Outputs from the above will enhance finding aids and help unlock the ability to infer relationships between disparate collections of records.

However, despite major advancements in the democratization of AI technologies, barriers to entry remain.

Many open-source tools for ML and AI can be complicated to set up, and Information Security policies that forbid the use of open-source tools within institutions remain all too common.

These base AI tools typically also require the provision of large-scale training data sets to train models that can then be used to evaluate new data, and producing such training data sets is a mammoth task.

Factors such as data quality, data diversity, data privacy and data scale all need to be considered and make producing such data sets a laborious endeavor. This is typically beyond the scope of an archival role and usually beyond even the heritage organization in which those roles exist.

Meanwhile, the various AI products and services that come pre-trained have different problems.

These are often “black-box” implementations, where users cannot see or know what data has been used to train the model, or the details of the models used. Although implementers may provide metrics about the model’s performance against established “benchmark” datasets, these can show very good overall performance even if the model has systematic flaws. These flaws might be as simple as underperforming under particular or sub-optimal lighting conditions, which are likely to just be inconveniences or caveats to widespread use.

However, these flaws can be substantially more sensitive, particularly when trying to detect humans and human features rather than just inanimate objects. Misclassification based on skin tone is one such well-documented flaw [6] [7] that has affected multiple products and services. These flaws may be much more fundamental to questions around whether the service or product can or should be used at all.

A further issue is that whilst they have been trained on huge data sets, this has been with necessarily generalized classifiers that may not be precise enough to optimally surface the information found with domain-specific record collections.

Consider a business archive with an extensive collection of visual assets with identifiable entities, including graphical images featuring historic branding, or photographs featuring key locations and people. Unless these brand identities and other entities have mainstream recognition, these entities may not have been precisely classified, nor may relevant assets even have existed within the training data consumed by the general-purpose tools.

Similarly, although a general-purpose model may be able to successfully identify a common object such as a car, it may not have sufficient specificity to distinguish brands, models or model years, which may be relevant information to a car manufacturer’s business archives.

In this work we have focused on the latter issue of whether a commercial-off-the-shelf (COTS) AI image analysis service can provide value to even a small archive with a collection of images of “uncommon” domain-specific objects. An optimum output in this case will be one that enables archival practitioners to leverage the power of tools pre-trained on large-scale data, but which additionally enables them to enhance the training data in an intuitive and relatively low-effort manner without the need to become an expert in preparing data for training within such systems.

In this paper we present an approach for doing precisely this, entirely within the context of our established digital preservation system (DPS), Preservica (hereafter simply referred to as “the DPS”).

Concepts To Prove

We set about integrating the DPS with a COTS ML based tool, in this case we chose Microsoft’s Azure AI Vision [8] (hereafter simply referred to as “AI Vision”).

This has a pre-trained base model (a “Transformer Model”) created for performing object detection (recognizing common objects in an image). The tool also performs caption generation, (creating a human readable sentence describing the contents of an image), which uses a combination of Convolution Neural Networks to generate various tags, and a language model to generate comprehensible sentences from these [9].

The base models underlying this are controlled by Microsoft, meaning end-users cannot improve performance or train them to recognize new types of objects. However, Microsoft do enable users to provide their own custom object detection model, which can be run at the same time. In order to do this, users must have a Microsoft Azure account, with the relevant permissions, and use Microsoft’s web-based Graphical User Interface (GUI), Vision Studio [10]. This GUI is a feature-rich application that presents a steep learning curve to new users. Coupled with the administrative load of managing Azure services, users and permissions, this is not a viable approach for allowing end-users to provide feedback or corrections to the output. For that, we wanted to demonstrate that the main steps involved in that process could be accomplished within the DPS itself.

We approached the problem in two phases. In the first we aimed to provide completely automated descriptive metadata generation for images based on common objects, allowing end-users to search images based on their contents. This would demonstrate that the DPS, and AI tool had sufficient Application Programming Interface (API) functionality to integrate.

In the second, we aimed to extend this approach by replicating a potential use case of a corporate archive wanting to flag instances of their corporate branding in an image collection. We did this by overlaying a custom domain model, trained to detect the Preservica company logo, and crucially, providing an end-user interface to provide additional training data to improve model performance.

Automated Metadata Generation

To perform the integration, we created a new standalone web-application, which could interact with both the DPS, and AI Vision via each system’s public APIs. The role of this application was to retrieve content from the DPS, run AI Vision against it, and update the DPS with descriptive metadata based on the response.

We limited the application to only consider content as it arrived in the DPS, i.e. not having to worry about any backlog of historic content. This was achieved by configuring it as a listener/subscriber to Preservica’s Webhooks API [11], meaning it receives real-time “push” notifications as content is ingested into the repository. Using this notification, it was configured to query the DPS to determine the type of content ingested, and only act if the content was an image.

The application would then download a copy of the image so that it could forward it to AI Vision, requesting both object detection and caption generation. The response that AI Vision provides is a JavaScript Object Notation (JSON) document containing a “caption”, and a list of detected objects (Figure 1). Each object contained metadata describing the type of object, the co-ordinates of the area of the image in which it was detected, and the AI system’s confidence that it had correctly identified the object (expressed as a numerical value between 0 and 1).

Figure 1 - Sample JSON response from Caption generation and object detection

Figure 1 - Sample JSON response from Caption generation and object detection

The application would then transform this JSON into Extensible Markup Language (XML) in a custom schema (as Preservica’s metadata handling requires XML documents). For this proof-of concept, we performed minimal processing of the document. This XML document was then posted back to the DPS using its metadata update API [12].

On the the DPS side, we registered the custom XML schema, which enabled us to define custom search indexing of metadata in that schema. This means that the caption itself, and each of the objects can be indexed and used for free-text, filtered or faceted searching (Figure 2).

Figure 2 - Screenshot showing a search for a caption returning results for images

Figure 2 - Screenshot showing a search for a caption returning results for images

This provided an end-to-end demonstration of the automated generation of descriptive metadata for images that could be used as finding aids.

Custom Model And Feedback

As stated above, AI Vision allows a custom model to be provided, and provides an API to trigger analysis by this custom model. Vision Studio provides a GUI to create an annotated training dataset and to initiate training of the model. AI Vision also provides API endpoints to trigger all of these actions.

The dataset comprises a JSON object which lists Uniform Resource Identifiers (URIs) for a number of “annotation files”. Each of these uses the Common Objects in Context (COCO) [13] data format to encode details of the image sources and the location of objects within it. Each annotation file may contain annotations for multiple images.

We created an initial training dataset and custom model via the GUI, using a small sample (~5 images, c.f. the suggested minimum of “2-5 images per category” from the documentation [14]) of images containing photos from our marketing and social media accounts, depicting Preservica attendance at various conferences and meetings. These contained a number of instances of the Preservica company logo under various lighting conditions and at various display angles. We then trained a custom Transformer Model using this dataset, and modified our application to use that and not just the default model.

The AI Vision response is a slightly extended form of the same JSON response from the previous section. This time, the results from the custom model are provided in the same way as from the default model, but in a separate JSON node, making it relatively trivial to distinguish between results from the two models.

Again, this was processed into an XML form and fed back to the DPS, meaning that we could now index results from the two models.

The fact that the metadata provides not only the labels, but enough co-ordinate information to know where in the image the object was found meant we could also incorporate this into the DPS’s in-built rendering, highlighting what the AI Vision had detected. The distinction in the metadata meant we could distinguish between the two on our renderer by using different colours to draw the bounding boxes (Figure 3).

Figure 3 - A screenshot of an image rendered in the DPS with objects detected by the general-purpose model drawn in red, and "Preservica logos" detected by the custom model in blue

Figure 3 - A screenshot of an image rendered in the DPS with objects detected by the general-purpose model drawn in red, and "Preservica logos" detected by the custom model in blue

Our initial training data set was very small, and so the model performance was, as is to be expected, relatively poor. For the most part, it was strongly over-matching, identifying other institutional logos, wall sockets, faces and shoes as being Preservica logos (see Figure 3). However, it should be noted that the focus of this investigation was not the actual performance of the model, but whether or not the performance could be affected by simple to use feedback from users.

We extended our application to enable a modified renderer, this time only highlighting results from the custom model. Further, we made this overlay user editable, so that users could remove and draw new boxes to “re-annotate” the image. Once the user is happy with the number and positioning of these bounding boxes, they can generate a COCO annotation file for the image (Figure 4).

Figure 4 - Screenshot showing the annotation creation GUI, alongside the COCO annotation that it creates

Figure 4 - Screenshot showing the annotation creation GUI, alongside the COCO annotation that it creates

Using this to update our custom dataset involves writing the image itself and the annotation file to a location for which we have a URI that can be accessed by AI Vision; for the purposes of this work, we simply used Azure Blob Storage under the same Azure account. We could then update the dataset to add in references to this annotation file.

Updating the dataset is not on its own sufficient to improve performance. To achieve this, we also need to retrain the model using the updated dataset. In AI Vision, this actually means training a new model. This created an extra complication as to make use of this, we had to ensure that our application could track the status of the training (which is a lengthy process taking close to half an hour even for our very limited dataset), and, once the new model was ready, could switch over to using that.

What we found in executing this process was that we could effect marginal improvements in performance of the model by providing additional annotations from a small number of files.

This provided an end-to-end demonstration of being able to train, use and refine custom models from within the context of a digital preservation system.

Conclusion, Discussion And Future Work

In the course of this work, we demonstrated proofs of the concepts that we set out to demonstrate:

  1. that by making use of the APIs of a digital preservation system, and an AI backed metadata enrichment tool, we could fully automate the generation of descriptive metadata for content that can enable discovery and presentation;

  2. that by using a COTS tool which enables layering of models, we could leverage both the benefits of large scale pre-trained models, and custom, domain specific models in an AI tool;

  3. that by making use of APIs, we could enable the digital preservation system to provide users with a simple interface for refining the custom model.

However, in doing this we have encountered a number of areas for further investigation and work before making a generally available product. The main area of interest here is around controlling the creation of a new custom model.

Training a new model was found to be expensive both in terms of the elapsed time required and the relative financial cost. AI Vision is a commercial tool, and as such there are costs associated with all aspects of using this tool, but whereas running an analysis on an individual image has a cost of fractions of a cent, training a model against even our very small dataset has costs in the tens of dollars range.

If the dataset is only receiving updates that represent a small percentage of the total size of the dataset, we can expect the improvements in performance to be marginal. This also affects how frequently re-training should be performed.

It is clear to us that re-training the model with every submission of content would not be efficient in a cost-benefits analysis. It is not however clear where the threshold for that decision is, whether it’s a decision that can even be automated or whether it will be a judgment call made by some administrative user to decide when to re-train.

The core of this work has been the development of an application that sits between the DPS and the AI service provider, and although we have necessarily required actual implementations of each of these systems, the ideas should be applicable to any DPS and any AI tool with sufficiently featured APIs. Indeed, the AI tool can be abstracted into a broader “metadata-provider” that could take many other forms such as existing metadata catalogues and data sources; of course, the user-feedback loop on these is unlikely to be needed.

In the future we will also intend to investigate other AI service providers and/or products which could build on the exciting possibilities hinted at by this work.

ACKNOWLEDGMENTS

This work was accelerated through participation in Microsoft’s Partner Hackathon series, the authors would like to thank the Microsoft staff for their input.

Comments
0
comment
No comments here
Why not start the discussion?