Preserving Users’ Knowledge of Contemporary and Legacy Computer Systems: Automated Analysis of Recorded User Activity
·
Abstract – Preserving and ensuring access to operational knowledge is essential for future users of emulation technologies. Emulation allows for the collection of a vast array of computer systems and software products, each requiring specific operational knowledge for effective use. Beyond manuals and documentation, an important source of targeted operational knowledge comes from observing experienced users in action as they interact with these systems. In this article, we propose an automated approach to document user interactions during an emulation session. We then apply computer vision (OCR) and modern large language models (LLMs) to automatically analyze and distill knowledge from the captured data. This process allows the extracted (text-based) information to be easily retrievable via standard search utilities; furthermore, the information is prepared to become actionable for both humans and machines.
This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Elizabeth Kata, Sharon McMeekin, Nathan Tallman and 1 anonymous reviewer. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.
Introduction
A lot of effort has been invested in the past years to make emulation as a preservation tool more accessible for a wider range of stakeholders, including non-technical users. Ongoing work continues to focus on providing integrated emulation workflows, such that digital objects from a repository can be seamlessly accessed and rendered, and software performance can be reproduced. For example, the Emulation-as-a-Service Infrastructure (EaaSI) program of work, which is built upon the Emulation-as-a-Service (EaaS) framework, has made strides since 2018 to make emulation “easier” for digital preservation practitioners and end-users in research and memory institutions [1]. The EaaSI platform provides a browser-based interface that enables practitioners to access multiple emulators and configure emulation environments that meet the needs of their specific collections and researchers. The shared expertise among digital stewardship practitioners, cross-institutional collaboration, and inter-generational knowledge management are already essential for digital preservation and will only become more critical in the future. Previous research by the Software Preservation Network [2] has revealed both the need for, and the challenge of, repeatable workflows that busy practitioners can use to preserve, document, and provide access to highly heterogeneous software and software-dependent collections—including through emulation.
The field can already look back on more than 40 years of lively and culturally relevant computer history. In a software preservation context this history creates usability challenges for the current generation of users. Even if practitioners have the time, resources, and skills to engage with emulation, they cannot possibly be expected to possess fluency in navigating every operating system or software landscape of the past. For example, a digital preservation librarian who successfully renders an important architectural drawing collection from their institution’s holdings in emulation may not have the time to cultivate expertise in using the historical computer-aided design software that was emulated to produce the rendering [3]. The librarian’s public services and reference colleagues may have even less capacity to guide end-user researchers in wayfinding through the software or the emulation environment. Because significant resources are required to surmount these obstacles and provide a productive research and access setting for end-users, there is a serious risk that emulation will remain out of reach as a preservation and access method for many institutions.
Over time, we can observe not only the growing complexity of objects and their system dependencies but also an ever-growing number of different, eventually obsoleted computer systems entering collections. With new software and hardware, new user interface concepts are introduced, and new skills and knowledge become required to operate these systems. For contemporary users, this is usually only a minor issue as they can adapt to a slow pace of incremental changes. However, over time, these changes accumulate, and knowledge on how to operate obsoleted systems falls out of use and ultimately becomes unavailable. Hence, documentation is and will be necessary to operate and access objects of interest. The main difficulty is to make documentation available in a way that supports users without asking them to read manuals of the full software stack.
In this article, we are investigating how the knowledge of experienced users can be captured, e.g., to support typical workflows and tasks such as maintaining computer systems, installing and configuring software, and common usage patterns like rendering and using preserved objects. Our goal is to analyze, generalize, and organize this knowledge, such that it can be used in a context-aware, simple, composable, and actionable manner, providing guidance, and eventually automation, as needed.
Capturing Knowledge of Contemporary Users
Knowledge from users who possess substantial experience in operating certain computer and software systems needs to be systematically documented in a manner that is both abstract and reusable to ensure that future users can comprehend and utilize it. Ideally, the documented knowledge should be actionable for users as well as automated systems that can support them in carrying out their tasks. This poses two challenges: articulating the contexts of legacy user interfaces in a manner that is comprehensible and searchable, i.e., discoverable by future users, and capturing user interactions such that these can be manually mimicked, automatically replayed, or reused in some other form.
Describing User Interface Interaction
User interfaces are typically designed to not require instruction manuals but rather to work with users’ existing expectations and mental models about how to use the system to accomplish their objectives. This industry-wide approach to user interface design was already formulated in Apple’s 1981 goal to reduce users’ learning time to 20 minutes [4], IBM’s late 1980’s “Common User Access” project that worked on unifying user interfaces across different graphical and text-based systems [5], and found its expression in leading user interface design literature [6]. As a result, computing devices have become “easy to use” exactly because there is no overarching stable vocabulary or visual style that users would need to learn first, even for common user interface elements and interaction patterns. For instance, the version of Google’s Material Design guidelines [7] current at the time of writing calls most elements “buttons,” even if they rather look like what previously would be interpreted as a “hyperlink.” The document describes a “top app bar” that reveals otherwise invisible elements when users scroll content in a particular direction—an interaction pattern that would not be deemed acceptable for desktop computing in the 1990s. What Apple’s Human Interface Guidelines [8] currently name “toggle” is called “switch” in Material Design. Contemporaneous users do not need to know these terms to use software, and they do not need to remember the semantics and behavior of elements they used to interact with on a previous version of a system, they just must deal with small incremental changes on each update. A current example is the section on Augmented Reality in Apple’s Human Interface Guidelines, which at the time of writing references desktop elements like application windows. Assuming Augmented Reality interfaces will continue to be produced, it seems likely that the use of desktop-like windows and the associated vocabulary is merely transitional.
In summary, while writing documentation about their user interface interactions, users might not be aware of their vocabulary and reference model for describing what they are doing. They are likely to omit information that seems obvious at the time of creation. Hence, future users will increasingly have difficulties following instructions the more time passes since they were written.
Capturing and Replaying User Interactions
Within existing emulation frameworks (e.g., [1]) overarching tasks enacted on an emulated environment can be broken into discrete interactive sessions. For instance, the preparation of a specific software setup can be split into multiple installation and configuration steps. The states of an emulation environment before and after such a step physically express the changes made, e.g., as snapshots of the system’s disks. But these states do not contain information on how these changes were achieved. This information can be added by manually by labeling intermediate states and adding notes describing the actions that were executed during a session [9].
It is technically trivial to collect and store this text-based information and associate it with session data such as snapshots of disks. Assuming that good and comprehensible descriptions are created and reasonably indexed, future users will have the option to follow the noted steps and bring relevant components of the system into a desired state. However, crafting such descriptions involves effort and dedication, and runs the risk of becoming incomprehensible after some time (cf. prev. section). This current state of knowledge collection matches Stage 1 in Table 1.
To complement unstructured session notes, user-generated events, such as mouse movements, key presses, touch, etc. could be captured (e.g., [10]). Such event recordings are technically easy to create and require no additional effort from the user. The collected events can theoretically be played back automatically, but the results are not necessarily reproducible: even when re-triggering the exact recorded inputs in the same software environment they were recorded in, the system’s reactions as well as the look and position of user interface elements are not necessarily deterministic and can turn out differently per session. For instance, systems can make on-screen elements appear at different points in time, and the position and appearance of elements can vary from the setting expected by the recording depending on time of day (“dark mode”), or previously executed actions. In summary, a fully automated recording of user activity is difficult to re-use as it is not comprehensible by humans and doesn’t reliably produce results in non-deterministic settings when simply replayed.
To improve both reliability of successful playback, and, more importantly, human comprehensibility, the user should annotate important interactions, by visually marking the important aspects (context) of the user interface and describing the exact intent of individual actions, such as “pick third item in dropdown menu.” Ideally a screenshot of the manipulated element would be included [11]. This annotation transforms a mere recording into recorded and executable documentation and greatly improves re-usability. However, this approach requires considerable effort and dedication by the user and therefore probably only makes sense for select important cases.
Deriving Requirements for Re-Usable User Interactions
Based on the previous discussion, we can conclude that combining different strategies is essential to eliminate extra work at the time of recording, at least for common and frequent cases, while still retaining the option to integrate further automation and fully reproduce interactions. This leads to the first requirement (R1): capturing the user’s interactions with a computer system in a fully automated way without additional effort for the user. Therefore, we require a recording that is complete, i.e., no observable information is missed after a session has ended. Thus, the result should be comprehensible, especially for human users, and individually reusable. Both completeness and comprehensibility are crucial for the recordings’ long-term sustainability, as these recordings may not be possible to recreate in the future due to a lack of experienced users.
The second requirement (R2) is considering that the captured knowledge can be easily found by future users. Therefore, it is important to describe the activities within a capture in a manner that facilitates easy discovery. Users should be able to search for specific tasks that would be covered either by a capture of an entire session, or its segments. To ensure wider and effective reuse of the captured knowledge, it must be translated into a descriptive, searchable representation, which is more abstract than just describing the individual input events the user triggered during recording.
Finally, once the captured content is in a descriptive format, it can be further transformed and condensed to support various reuse scenarios, including automation. Automated replay of captured events is important for maintaining an ever-growing number of software setups. It needs to be designed so that the captured knowledge can be applied in other (matching) contexts and integrate into user workflows that weren’t necessarily anticipated at the time of recording. Therefore, the third requirement (R3) is an abstract and generalized representation that supports translation into a technical (machine-actionable) representation and ultimately becomes composable.
Related Work
Woods and Brown previously highlighted the importance of contextual knowledge in emulation scenarios more than a decade ago [12]. They developed software to assist users in operating and navigating emulated environments. The primary purpose of this software was to organize and execute scripts, which encapsulated the necessary operational knowledge for installing, preparing, and accessing legacy software from a CD-ROM collection. The intention of the authors was similar—providing tools created by contemporaneous users to aid future users with common tasks, rather than solely relying on written documentation. However, their approach had limitations due to the significant effort required not only to learn the scripting language but also to develop the scripts. Additionally, the scripts were applicable only in very specific settings, thus offering little benefit for reuse in other contexts.
Today, a very popular method for sharing operational knowledge is through videos, more specifically, screencasts, which allow viewers to see the interaction with the system “through the user’s eyes.” Instructional videos and even live streaming, e.g., of gameplay or productivity tools, have become a huge phenomenon with millions of viewers and a new alternative to written documentation [13]. Different from text, videos are not easily searchable. To find targeted information, either time-tagged transcripts or semantically enriched annotations are required. Until recently, these were difficult to produce and typically required manual work by users [14]. As part of the recent developments in computer vision and neural networks, methods for understanding the content of screencasts and automated information extraction are being explored, for instance to derive code editing steps from programming tutorial videos [15],[16]. A study was able to successfully identify user actions on mobile applications [17] to assist with the annotation of instructional videos.
Reliably replaying the activities depicted in a screen recording in an interactive scenario is challenging. Different methods with or without user assistance are typically used for UI testing or debugging (e.g., [18]) or for UI performance evaluation (e.g., [19]). Recently, LLMs have been utilized to control a simple user interface and to translate and replay user-generated bug reports [20]. Deep learning methods were applied to improve the success rate of reliably replaying recordings of user actions [21].
For our goal of assisting future users, we build on these concepts but aim to provide both an abstract and an actionable description, ideally making generic workflows replayable.
Recording The User’s System Interactions
As a first step, we explore the practical aspects of achieving our initial requirement: a complete and comprehensive capture of user interactions with computer systems (R1). The goal is to capture every observable detail during a user session, i.e., every detail that involves a user performing an input action or any legible reaction from the computer system the user is interacting with. The capturing process should not require any additional effort on the part of the user.
A typical emulation usage scenario today builds on a remote desktop session. This setting is ideal to observe a user interacting with a computer system because it restricts the exchange between the user and the computer system to well-defined events defined by the protocol used for the remote desktop session. While the user might not be able to interact in every possible way with the emulated computer system—e.g., using swipe gestures is not supported in the protocol—, their interaction with the system can be guaranteed to be recordable. The technical format for recording these events needs to be as widely deployed as possible to ensure reusability and long-term sustainability of the recordings, even in future contexts not originally envisioned. We have limited our consideration to standard formats and discovered that no suitable format exists that can encapsulate both the system’s video output and the user’s input events.
We decided to keep input and output information separated and searched for a concept that interrelates both information streams, and finally implemented a system to capture emulator output using a standard video format accompanied by a text track (“subtitles”) file in WebVTT format [22]. While the video file alone would not be enough to completely represent the user’s interaction with the computer system, the text track associates timestamps with input events (key press or release; mouse button press or release; mouse move), providing a description of the input event in JSON format. This includes the type of event and its on-screen coordinates, or the keys pressed. The resulting video and the VTT (text) file are (individually) reusable, without the requirement for any specific tool and are likely preservable without issues.
The result of this step is a reusable capture of a user’s emulation session. Provided that the available input devices are similar enough to what was used during the recording session (in the future, legacy input devices can also be provided in virtual form, for example, as an on-screen keyboard or trackpad), it is likely that activities documented in video form can be manually mimicked by future users without detailed knowledge of the actual software stack. This reduces the remaining requirements for manual text descriptions to declaring the intent of the performed and recorded actions. There is little to no long-term risk storing this session information, as it remains accessible in a format that is both ready for current human and future machine consumption. This result is represented as Stage 2 in Table 1.
Describing User Actions
Once we have recorded screen output along with user input, we need to transform the recording into a searchable representation (requirement R2). This ensures that once the recording is saved, it remains findable. To achieve this, we need to analyze the recording to gain a deeper understanding of what occurs within the video, since the video and associated events are, in their current form, only visually comprehensible. The objective of this step is to create a "transcript" of the actions performed. A simple text-based representation would be beneficial since a lot of formal (e.g., manuals) and informal (e.g., blogs, user forums) documentation is already text-based.
Not every video frame contains valuable information, especially during typical "desktop" sessions, when the screen remains static while the user is considering their next action. Since our goal is to describe the user’s actions, we only analyze frames with active input events. More specifically, for our prototype, we focus solely on mouse events. That is, for a subset of possible action recordings using a desktop setting and containing only mouse events, our working hypothesis is that it should be possible to deduce a user’s planned action or task from a series of mouse clicks. In the next iteration of the prototype, we plan to expand this assumption to include additional events such as keyboard inputs and mouse-drag operations, among others. Based on this initial assumption, our prototype selects only key frames where a mouse click was observed for further analysis. The result is a series of observations, each consisting of a screenshot paired with an associated mouse event. This includes the mouse position as absolute coordinates and the mouse button pressed (e.g., left, right, etc.).
Given this information, we needed to select a technology capable of converting this information into text that is both human comprehensible and searchable. We aim to use standard, off-the-shelf technologies and methods to ensure that our analysis is not reliant on any specific implementation or vendor and opted for two different methods: OCR (Optical Character Recognition) and a “vision-enabled” multimodal large language model (LLM) [23]. The ability to translate an image into a textual description through configurable prompts is one of the most significant developments in this field. However, even state-of-the-art models are currently unable to locate specific UI elements, for example, the pixel-exact location of a UI element, such as a menu entry or an “OK” button. This led us to use OCR to extract layout information of labeled UI elements, which combined with the stored mouse coordinates provides the labels of clicked elements.
Both OCR and the multimodal LLM are generic tools suitable to interpret the visual output of computer systems from different technical eras and provide rather accurate descriptions and location information. We chose PaddleOCR [24] due to its strength identifying labels and describing the screen’s layout (see Figure 2). We used GPT-4 Vision [25] and LLaVA [26] as multimodal LLMs. With the exception of GPT-4 Vision, all vision components were executed on a local machine.
Due to our choice of technology and tools, our approach is currently limited to user interfaces with labeled UI elements. For purely symbolic user interfaces consisting only of graphical icons without labels, e.g., most games, methods as suggested in [27] could be utilized.
The user interacts with the "Display Properties" dialog box in a Windows environment, specifically within the "Screen Saver" tab. Initially, no screen saver is selected ("None"), and the system is set to activate the screen saver after 10 minutes of inactivity, without password protection upon resumption. The user has options to customize the screen saver settings via the "Settings" button, preview changes with the "Preview" button, and adjust power management settings through the "Power..." button. Clicking the "OK" button will apply any changes made, closing the dialog box and implementing the settings, such as activating a chosen screen saver after the specified period of inactivity.
Example 1: A summarized description of Figure 1.
The result of this analysis is a description of the user’s session recording for every key frame. Example 1 shows a summary description of the screenshot depicted in Figure 1. Due to a stable and complete representation of the user’s session (R1), the approach also allows for re-analysis and re-evaluation of the key frames or the complete video at a later point in time, taking advantage of any future improvements in LLM “vision” capabilities. Our initial results are encouraging, as it is already possible to perform meaningful (text-based) searches over recorded sessions.
Abstract Text-based Workflow Descriptions
In order to achieve reproducibility and composability of recorded actions, both user input and emulator output must be understood on a more abstract, semantic level: instead of adhering to exact screen coordinates and pixel patterns, a reduced action plan in the form of simple, human readable description of activities to perform—such as “Open the control panel and select printer setup” (R3). From our initial requirements, we have already improved comprehensibility and findability by describing each key frame. The next goal is to create an abstract representation of the observed actions.
We apply core features of large language models (LLMs)—text abstraction, summarizing, and translation—to the textual representation of the descriptions created in the previous step. LLMs are "pre-trained" on a vast amount of text data [28] and through this training process a comprehensive probabilistic ontology about real-world concepts has been built. Thus, today’s LLMs are quite capable of working with natural language, particularly its semantics and correlations. We leverage this capability in two ways: First, through the technical understanding of real-world concepts, an LLM can generalize and summarize the description of the user’s interactions, e.g., by abstracting the language and removing unnecessarily specific or detailed information. Secondly, these condensed and focused descriptions can be translated into a series of precise descriptive user instructions. These consist of a precondition for the action, an description of the action, and the expected outcome. For both precondition and expected outcome, descriptions of the previous and following video frames are added as additional context to the LLM prompt. Example 2 shows the instructions derived from Figure 1.
Precondition: 'Display Properties' dialog box is open on the 'Screen Saver' tab. Action: Click the 'Power' button. Expected Results: 'Power Options Properties' dialog box is open.
Example 2: Abstract action description from Figure 1
The result of this step is a fully automated annotated video with user instructions that can be found using text-based search. Although the descriptions may not be 100% accurate in all cases, the additional video with original event information remains available and helps to disambiguate any given situation. Furthermore, the instructions are generalized and abstracted, such that this sequence of instructions can, in a potential future evolution of the built prototype, be applied in similar yet different settings. This result reflects Stage 3 in Table 1.
Preliminary Results
For an initial evaluation, we have recorded 23 videos from 8 different operating systems ranging from Apple Mac OS 7.5 to Windows XP. All systems use a desktop metaphor and the recorded videos focus on typical tasks with these systems, such as installing a printer, printing a document, or configuring a screensaver. The interpretation of videos with a series of simple click events works reasonably well (formal evaluation is pending). Example 3 shows a list of actions derived from a 38 seconds long recording using a Windows XP operating system.
1
Precondition: User is at the Windows desktop with no applications open. Action: Right-click on the desktop and hover over 'Properties'. Expected Results: A context menu appears with various options including 'Properties'.
2
Precondition: 'Properties' is visible in the context menu. Action: Click on 'Properties'. Expected Results: The 'Display Properties' dialog box opens.
3
Precondition: The 'Display Properties' dialog box is open. Action: Navigate to the 'Themes' tab. Expected Results: Options for managing the operating system's visual theme are displayed.
4
Precondition: The 'Themes' tab is selected in the 'Display Properties' dialog box. Action: Select a theme from the dropdown menu. Expected Results: The preview pane updates to show the selected theme's appearance.
5
Precondition: A theme is selected in the 'Themes' tab. Action: Navigate to the 'Screen Saver' tab. Expected Results: Options to customize, preview, or apply the screen saver settings are displayed.
6
Precondition: The 'Screen Saver' tab is selected. Action: Click the dropdown menu and select the 'Windows XP' logo screen saver. Expected Results: The 'Windows XP' logo screen saver is selected.
7
Precondition: The 'Windows XP' logo screen saver is selected. Action: Navigate to the 'Power Options Properties' dialog by clicking the 'Power...' button. Expected Results: The 'Power Options Properties' dialog box opens with various power management settings.
8
Precondition: The 'Power Options Properties' dialog box is open. Action: Select the 'Home/Office Desk' power scheme. Expected Results: The 'Home/Office Desk' scheme is selected with options for monitor, hard disks, system standby, and hibernation times displayed.
9
Precondition: The 'Home/Office Desk' power scheme is selected. Action: Change the 'Turn off monitor' option to 'After 20 mins'. Expected Results: The monitor is set to turn off after 20 minutes of inactivity.
10
Precondition: The 'Power Options Properties' dialog box is open. Action: Navigate to the 'Hibernate' tab. Expected Results: Options for managing hibernation settings are displayed.
11
Precondition: The 'Hibernate' tab is selected. Action: Select 'Never' from the dropdown menu in the 'Settings for Always On power scheme' section. Expected Results: The system is configured to never automatically enter hibernation mode.
Example 3: Disabling the Screensaver for a Windows operating system
This description is already very useful and accurate, however in steps 4 through 6 already present default settings were misinterpreted as target settings, although during the recording they were not manipulated and don’t contribute to the user intent of disabling the screensaver. It wouldn’t be difficult to manually remove the superfluous steps in a quick review process from this short and highly abstracted summary. With tasks that require more steps even experienced users might overlook such errors.
1. Open the Control Panel from the Start menu. 2. Select the 'Switch to Classic View' option. 3. Double-click on the 'Power Options' icon. 4. Select the 'Power Schemes' tab. 5. Choose the 'Home/Office Desk' scheme and set 'Turn off monitor' to 'After 20 mins'. 6. Click 'Apply' or 'OK' to save the changes. 7. Select 'Never' from the dropdown menu under 'Settings for Home/Office Desk power scheme' for the 'System hibernates:' option. 8. Access the 'Display Properties' dialog box via desktop right-click or Control Panel and select the 'Themes' tab. 9. Select a modified theme and click 'Apply'. 10. Open the 'Display Properties' dialog box and navigate to the 'Screen Saver' tab. 11. Select a screen saver from the dropdown menu and set the wait time.
Example 4: Disabling the Screensaver for a Windows operating system, recorded with a different Windows theme and using a different UI path.
In Example 4 (for brevity, we are only showing the generated list of action items here), we recorded the same action of disabling the screensaver again in a slightly different way and a different Windows visual theme configured. While a recording/playback system strictly based on visually matching recorded UI elements would not be able to match these two recordings, our LLM-based approach produces a very similar list of actions to the one in Example 3. While totally effortless for the recording user, such a list of actions could already serve as a good documentation of the actions required to, e.g., disable the screensaver in Windows. One particularly interesting item is the second step of “Switching to Classic View.” While this step seems trivial for an experienced user and, thus, would have easily been forgotten when writing such documentation manually, it is a very important step for users not experienced with the two views of the control panel Windows XP offers. If step 2 were left out, the control panel would look completely different and users would not be able to complete the following steps.
1. Double-click the 'Harddisk' icon. 2. Navigate to and double-click the 'Apple Extras' folder. 3. Navigate to and double-click the 'Apple LaserWriter Software' folder. 4. Double-click the 'Desktop Printer Utility' icon. 5. Select 'LaserWriter 8' from the 'With' dropdown menu and highlight 'Translator (PostScript)' under 'Create Desktop...'. 6. Click 'OK' to proceed with printer creation. 7. Enter a name for the printer and click 'Save'. 8. Open the utility, select a 'Generic' PostScript Printer Description (PPD) file, choose 'Desktop' as the default destination folder, and click 'Create...'.
Example 5: Installing a printer on Mac OS 9
In Example 5 (again, for brevity, only the generated action items are shown), we recorded the installation procedure for a printer on Mac OS 9. While the actions generally serve as good documentation of the steps carried out, step 7 shows that the LLM is trying to abstract the workflow and does not state the name we actually chose for the printer (“PDF”) but states to “enter a name.” In step 8, it asks to “open the utility,” which, in fact, is the utility already opened in step 4 and which was partly hidden by other windows in the previous steps.
This illustrates that the automatic interpretation of complex input events requires additional work. For instance, a sequence of mouse button down, followed by a series of mouse moves and a final mouse button up event needs to be combined and interpreted as a compound mouse drag event. With the introduction of touchpads and gestures, the concept of compound events is becoming even more important. A similar problem is posed by keyboard inputs, which need to be combined meaningfully into a single string since typing character by character doesn’t carry much information in between the keypresses. For these cases, usually the frame at the beginning and the frame at the end of a compound are of interest. On the other hand, if the user is using keyboard shortcuts to control the UI, every key input or key combination matters individually. Further improvements of the LLM’s vision capabilities as well as their increasing bandwidth and speed, will improve visual description and thus be able to better correlate, interpret and describe the recorded input events.
Applications and Next Steps
An effortless recording and seamless integration of recording capabilities into emulation workflows is just an initial step to producing a high-quality collection of operational knowledge. Integrate this knowledge back into users’ work environments to assist operation would be a highly desirable next step. By using computer vision to match against recorded knowledge, particularly against an abstract representation of the preconditions of an action, users can receive context-dependent guidance when manually reproducing actions. The system will also be able to recognize when a task has been completed and link that information to an environment’s metadata. This describes Stage 4 in Table 1.
Furthermore, we can use LLMs to translate the instructions into a machine-actionable representation so that the instructions can be carried out automatically. That would be Stage 5 in Table 1.
Another significant advantage of an LLM is its ability to work with different human languages. Collections may include software and other digital objects from various cultural areas and languages. As a result, a user may encounter a user interface in an unfamiliar language. Many of today’s LLMs are trained in multiple languages and can seamlessly switch between them. Therefore, the same recorded instruction for configuring the system’s screensaver can be applied to systems in different languages.
Finally, users could avoid direct interaction with the computer system when they must contend not only with the barrier of a foreign language but also, at some point, a technological one. This means they are required to operate the system through unknown or unfamiliar (emulated) input devices. LLMs can bridge this gap by providing a simple and universal user interface, contemporary natural language, while drawing on high-quality and verified expert-generated knowledge.
Table 1 illustrates a possible pathway to implementation of this functionality into a emulation framework that could then be used by preservation professionals. Stage 1 represents the current stage of emulation frameworks. Already at Stage 2, with the introduction of automatic video recording, enough data would be collected that at Stage 3 would be processed into searchable, illustrated manuals, at Stage 4 could be used to provide in-session context-sensitive assistance, and at Stage 5 would be turned into executable instructions that could be used for configuration and migration tasks of large amounts of software environments without manual intervention. If the technical component enabling a particular stage becomes unavailable, the system can fall back on the previous one, with Stage 3 being the most productive level to be active over time and Stage 5 being the most user-friendly. However, even at Stage 2, materials generated at higher levels would remain actionable for manually following instructions in text, image, and video form.
Conclusion
Assisting users in operating obsoleted software and computer systems that are unfamiliar to them is already a significant challenge in digital preservation that will become even more relevant in the future. Reconstructing necessary knowledge and actions for unknown software from written documentation alone is and will be tedious. To support future users, it is beneficial to capture and preserve the expertise of knowledgeable users today. Unlike anecdotal knowledge or software manuals, a recorded demonstration of a given task by an experienced user has much higher practical value and will likely lead to its successful reproduction.
In this paper, we have investigated the requirements for capturing user knowledge by simply observing the user’s input and a computer system’s visual output. We have shown that it is technically feasible to produce self-contained documentation without burdening users with additional work. Furthermore, we have demonstrated that this documentation can be transformed into a findable, generalized, and therefore already useful form, by only using off-the-shelf technologies. Currently, a lot of resources are devoted to the development of LLMs. This fact will probably translate to further improvements for our needs.
Acknowledgements
Part of the activities and insights presented in this paper were made possible through the NFDI consortium DataPLANT, 442077441, supported through the German National Research Data Initiative (NFDI 7/1).
Klaus Rechert (University of Applied Sciences Kehl); Dragan Espenschied (Rhizome); Rafael Gieschke (University of Freiburg) & Wendy Hagenmaier (Yale University Library) for #126 Preserving Users’ Knowledge of Contemporary and Legacy Computer Systems