Description
Contribute to deeplearninghowto/dp_bibliographic development by creating an account on GitHub.
Abstract – Digital Preservation refers to the series of managed activities necessary to ensure continued access to digital materials for as long as necessary, which has garnered widespread attention from institutions and individuals and has undergone extensive research. This study constructs a literature dataset based on Web of Science and Scopus, and applies the BERTopic model to identify topics and analyze development trends in the literature. The results show that research topics in digital preservation mainly include data preservation, cultural heritage preservation and related technologies, metadata and models, risk assessment and management, preservation of personal information, and website preservation. The peak publication years for different research topics vary, indicating fluctuations in the level of attention each topic receives. Digital preservation of personal information and cultural heritage has seen a steady rise in attention in recent years, marking them as emerging and popular research topics.
Keywords – Digital preservation, long-term preservation, BERTopic, Topic Modeling.
This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Elisa Rodenburg, Dr. Stephen Abrams, William Schlaack and 1 anonymous reviewer. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.
The composition of digital object bitstreams, namely sequences of 0s and 1s, requires specific software or even hardware to interpret their content for human users. Unlike physical objects, digital objects possess many characteristics that render them fragile and prone to destruction. For instance, a book from 200 years ago can still be read today, and we can clearly observe damages such as torn corners or wormholes. However, the situation is entirely different for digital objects. A video file from 10 years ago may be inaccessible today, either due to digital deterioration or the lack of compatible players. This scenario occurs frequently; the once-popular FLV format serves as a vivid example. Digital objects are susceptible to tampering, bit rot, and obsolescence.
The concept of digital preservation can be traced back to the 1990s. The earliest document found on Web of Science to fully use the term "Digital Preservation" dates back to 1991, in a research report from Cornell University [1]. With the rapid development of the Internet and the widespread adoption of personal computers during the 1990s, there was an explosive growth in digital objects, prompting increased attention toward their long-term accessibility. Consequently, research and practical efforts related to digital preservation proliferated.
Digital preservation has yet to have a unified definition. This study adopts the definition provided by the Digital Preservation Coalition (DPC), which states: "Digital Preservation refers to the series of managed activities necessary to ensure continued access to digital materials for as long as necessary ... refers to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological and organizational change" [2]. Any activity aimed at ensuring continued access to digital objects can be considered within the scope of digital preservation.
Initially, digital preservation focused on exploring the issues surrounding scanning, storing, retrieving, and providing access to digital images of brittle books [1]. Subsequently, its scope of concern gradually expanded, and its applications continued to grow.
In 1997, the Consultative Committee for Space Data Systems (CCSDS) released the first draft of the Open Archival Information System (OAIS) Reference Model, which was later approved by the International Standards Organization (ISO) as an international standard[3].To enhance the capture of metadata required for the preservation of digital objects, the Online Computer Library Center (OCLC) and RLG sponsored a working group that developed PREMIS (PREservation Metadata Implementation Strategies) which eventually became an international standard widely used by practitioners in digital preservation[4]. RLG, in collaboration with the Center for Research Libraries (CRL), published the Trustworthy Repositories Audit and Certification Checklist (TRAC), focusing on the trustworthiness of digital preservation[5]. Establishing a digital preservation repository that meets trustworthy standards is often challenging for individual institutions. Various collaborative organizations, such as the DPC, have emerged to address the challenges posed by digital preservation. National-level institutions, such as the Library of Congress (LOC), the National Library of Australia, and Digital Preservation Europe (DPE), have gradually intensified their focus on digital preservation. The scope of digital preservation objects has also expanded, encompassing 3D data [6], websites[7], research data[8], and more.
In order to keep track of the development status of digital preservation, many researchers have analyzed research progress and trends from various perspectives.
Gracy et al. reviewed research and professional literature on digital preservation published between 2009 and 2010, concluding that the main research focuses were on the challenges of library resource development, the impact of large-scale digitization on preservation, risk management, digital preservation and curation, and preservation education in the digital age. They observed a rapid increase in literature related to digital preservation and curation, an expanding scope and focus of research in the field, and growing attention to new technologies, tools, and issues. [9]
Burda et al. conducted a systematic review of 122 high-quality papers published between 1996 and 2011. Through coding and thematic analysis, they examined existing research progress in terms of drivers, stakeholders, scope, preservation requirements, and technical tools in digital preservation, concluding that there is a lack of in-depth exploration from an organizational perspective and highlighting the need for more research on cost-benefit analysis or decision-making in digital preservation. [10]
Murillo et al. collected and analyzed the journal articles and conference papers listed in the syllabi of ALA-accredited MLIS programs related to digital preservation. Using LDA for topic modeling, they identified the primary research themes in digital preservation as the contributions of libraries to digital preservation, technical requirements, digital archives, access and use, and research data management, noting that research data management and preservation are increasingly becoming key areas of focus.[11]
Patra et al. employed bibliometric methods to analyze academic papers published between 2001 and 2019, assessing publication patterns, document types, prolific authors, and contributing institutions, and conducted co-author network analysis.[12]
Ahmad et al. performed a meta-analysis of in-house activities and outsourcing phenomena in digital preservation, finding that early international research on digital preservation showed a tendency towards outsourcing. However, later studies indicated a complete shift towards in-house activities, suggesting that digital preservation is increasingly viewed as an intrinsic task of organizations.[13]
In summary, digital preservation has undergone over 20 years of development, with extensive research and practice conducted in areas such as metadata standards, reference models, and trustworthy certification. Researchers have analyzed the progress of digital preservation from various perspectives using methods like bibliometrics, topic modeling, and meta-analysis. However, based on the current literature, there is a lack of comprehensive thematic analysis in the field of digital preservation over the past two years. This study attempts to analyze the research topics in digital preservation using a new topic analysis method, BERTopic, with the aim of revealing certain patterns in the development of digital preservation research and providing thematic insights for future research and practice in this field.
In this study, Web of Science and Scopus were chosen as data sources, with publications limited from 1991 to 2024. In the Web of Science database, searches were conducted across all databases using the keywords ”digital preservation”, “digital curation”, “digital archiving”, “long-term preservation”, and “longterm preservation” for precise matching of subject terms. Similarly, in the Scopus database, the same keywords were used for precise matching in the "title, abstract, keywords" fields. Due to the application of terms like long-term preservation in various fields such as biology, medicine, and agriculture, to narrow down the search scope and improve result relevance, searches were restricted to research domains such as computer science, social science, information science library science, arts, and humanities. Additionally, searches were limited to English language publications and specific document types like Article, Meeting, Dissertation Thesis, excluding patents and other types. Duplicate entries were removed using reference management software EndNote, followed by manual screening of paper titles and abstracts by the authors to exclude papers that, although containing relevant keywords, did not pertain to Digital Preservation. The search was conducted on March 5, 2024. Finally, a total of 3066 paper bibliographic entries were included in the study dataset.
This study employs a deep learning-based topic mining method to identify themes and analyze developmental trends using the abstracts of all the papers. Common methods for topic modeling include Latent Dirichlet allocation (LDA) [14] and Non-Negative Matrix Factorization (NMF) [15]. However, BERTopic has shown superior performance compared to these methods [16].
BERTopic is a technique for topic modeling that combines the BERT (Bidirectional Encoder Representations from Transformers) model with traditional clustering methods to automatically extract meaningful topics from textual data.
Initially, BERTopic utilizes a pre-trained BERT model to convert text into numerical vectors. These numerical vectors capture the semantic information of the text, making semantically similar texts closer in the vector space and providing a basis for the subsequent semantic clustering of the text. The semantic representation capabilities of different vector models vary and directly influence the effectiveness of semantic clustering. BERTopic can employ various pre-trained language models based on Transformers to generate text embeddings. After comparison, this study selects the all-MiniLM-L6-v2 model for text vectorization.
The embedded vectors are usually high-dimensional, which makes processing complex and computationally expensive. To reduce the complexity of subsequent clustering calculations, BERTopic simplifies the data structure through dimensionality reduction techniques while retaining the main semantic features. This study employs the UMAP (Uniform Manifold Approximation and Projection) algorithm for high-dimensional data dimensionality reduction.
After dimensionality reduction, BERTopic clusters the vectors, grouping semantically similar vectors into clusters to form distinct topics, with the corresponding texts comprising all texts for a given topic. The stronger the clustering algorithm's performance, the more accurate the topic representation will be. BERTopic typically uses the HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) algorithm for clustering. HDBSCAN is a density-based clustering algorithm capable of identifying natural groupings (i.e., topics) in data and handling noise and outliers.
For each clustering result, BERTopic represents the topic by extracting the keywords most relevant to all the texts in that cluster. BERTopic employs c-TF-IDF (Class-based TF-IDF) for keyword extraction. c-TF-IDF is a variant of TF-IDF, adjusted to work at the cluster level instead of the document level. All texts in each cluster are converted into a single document, and by statistically analyzing the frequency of words/phrases across all texts in the cluster and further calculating their importance scores within the cluster, a number of keywords that best represent the cluster's theme are selected. These keywords are typically words that appear frequently and are distinctive in the cluster's texts.
The data processing workflow of this study is illustrated in Figure 1.
This study uses BERTopic for topic modeling of paper abstracts. To ensure that the topics are adequately focused, we set the min_topic_size parameter of BERTopic to 20, meaning that each topic cluster includes at least 20 papers. During the keywords extraction phase, we filtered out words that contribute little to the topics by using a stop words list. We utilized the stop words from the Python NLTK module and added "digital preservation," "digital," "preservation," "long term," and "long-term" to the stop words list. These additional keywords were chosen because they appear frequently but have limited effect on distinguishing topics, and they may diminish the contribution of other relevant keywords to the topic.
We set the model's ngram_range parameter to 2-3 to ensure that the extracted keywords include 2-3 words, enhancing the semantic completeness of the extracted topic terms.
Using the above settings for topic clustering, we identified 24 research topics in total. Given the large number of topics, this study selects the top 8 topics with the highest number of papers for detailed analysis, each involving over 80 papers. We visualized the top 5 contributing keywords and their contributions for each topic, as shown in Figure 2. From Topic 0 to Topic 7, the number of papers gradually decreases.
Topic 0 is about data preservation, with core keywords including “research data“, “data management“, ”data curation” etc. With the advent of the data era, topics related to data preservation have gradually attracted more researchers. Among all the papers included in the study, there are 200 papers on data preservation, making it the most common research topic. Research in data preservation includes the significance, current situations, challenges, practices, and case studies of data preservation, as well as strategies and pathways for data preservation. Some studies not only focus on data preservation itself but also on the preservation of data analysis software, indicating the complexity of the research topic. Simply preserving data without proper description and corresponding analysis software cannot be considered effective preservation.
Topic 1 focuses on effective digital preservation. General reference models and metadata description frameworks are the foundation for conducting effective digital preservation. Its core keywords include "reference model," "Dublin Core," and "metadata standards" etc. This research topic encompasses the emphasis by digital preservation institutions and researchers on normative and effective content preservation, resulting in extensive research and practice. It also includes the practical exploration by content preservation institutions and researchers using standard frameworks.
Topic 2, with core keywords including “3d models“, “ 3d laser“, “3d scanning“, etc., focusing on modeling and long-term preservation of three-dimensional digital heritage objects such as buildings, cultural relics, etc., using technologies such as 3D modeling and 3D laser scanning. This includes preserving many deteriorating three-dimensional objects, as well as screening, evaluating, and preserving data formats such as 3D data. With a rich variety of cultural heritage types, the need for preservation extends beyond two-dimensional content like ancient books and manuscripts to three-dimensional content like buildings, leading to extensive research in this area.
Topic 3 focuses on issues related to digital preservation in the context of digital publishing and open access. This includes research and discussion on the legal deposit and preservation system for digital publications, as well as studies on preservation strategies, technologies, and management policies. Additionally, it explores the challenges faced by libraries in the long-term preservation of digital publications and the measures taken to address these challenges. The core keywords include ”open access”, “legal deposit”, ”electronic publications” etc.
Topic 4 primarily focuses on the digital preservation of personal information, with core keywords including “personal information“, “electronic records“, “personal practices“, etc. Initially, digital preservation targeted public domain content such as library collections or digitized journals and manuscripts. With the rapid development of information technology, personal digital information has witnessed explosive growth, leading to increased attention from researchers on the preservation of personal digital information. The digital preservation of personal information inherently involves complexities. On one hand, personal information lacks standardized descriptions and comes in diverse types, making it challenging to organize it in a reasonable and regulated manner. On the other hand, different countries, regions, and even different areas within the same country have varying regulations regarding personal information and privacy protection. This results in differing technical and security requirements for preserving personal information. Compared to other digital content, preserving personal information faces greater challenges. Related research includes willingness, approaches, and practices for preserving personal digital information.
Topic 5 primarily focuses on risk assessment. The core keywords include “assessment frameworks”, “national stewardship alliance”, etc. Its main areas of concern are the risks faced by digital objects, the evaluation of digital preservation environments, the effectiveness of digital preservation strategies, and the practices promoted by national alliance organizations to address risks and ensure the effective preservation of digital content.
Topic 6 focuses on the preservation of cultural heritage, with core keywords including “cultural heritage“, “intangible cultural heritage“, “world heritage“, etc. Research in this topic mainly includes strategies and practices for digitizing various types of cultural heritage such as ancient books, manuscripts, temples, etc., as well as experiences and cases of digital preservation of cultural heritage from various countries and regions. Relevant research institutions include museums, art galleries, and other cultural heritage preservation institutions. With the continuous development of technology, researchers have begun to pay attention to the long-term preservation of cultural heritage in the digital environment, and the research focus continues to expand.
Topic 7 is about web site preservation, with core keywords include “web resources”, “web pages”, etc., focusing on the preservation of internet data. The preservation of websites inherently involves complexity, including issues such as content format diversity, naming problems, and content copyright ownership. Numerous research questions related to how to preserve websites, which strategies to use, what software or tools can be used for website preservation, the cost of preservation, and access strategies for preserved websites are topics of interest in this topic.
This study utilizes line chart to analyze the trends of various topics. In the charts, colors and markers distinguish different topics, making the data easier to read and understand. The x-axis represents the years, while the y-axis represents the number of publications. The top eight topics with the highest number of publications were selected for trend analysis.
From the chart, it can be observed that Topic 0 (data preservation) originated in 2003, starting relatively late, but quickly gained attention from preservation institutions and the academic community, and peaking in research interest in 2016. The number of studies has been declining since 2020. Whether this decline is related to the COVID-19 pandemic and the limitations on academic exchange activities requires further analysis.
Topic 1 and Topic 5 exhibit a similar development trend, both starting around 1998-1999, with a rapid increase in research interest post-2003, reaching a peak around 2013, followed by a declining trend. These two topics, concerning risk assessment and effective preservation of digital content, logically align well, and it is reasonable for them to exhibit similar trends.
Topic 2, relating to 3D cultural heritage modeling and preservation, has garnered more attention since 2006, maintaining a consistent level of research interest, and it remains one of the more focused research topics.
Topic 3 started relatively early, reaching a peak in research interest around 2012, after which its attention gradually declined and has remained at a relatively low level. Topic 7 (Web preservation) peaked in 2007 and then declined, with very little attention post-2015. This may be related to the increasing complexity of website data and the rising legal requirements for data and privacy protection.
Topic 4 (digital preservation of personal information) originated in 2000 and has received more attention since 2015, maintaining a high level of interest. This can be considered an emerging topic, likely due to the explosive increase in personal information and the growing concern for personal information protection and privacy from various sectors of society.
Topic 6 concerning cultural heritage preservation began to receive more attention in 2015, with a continuous rise in research interest. This indicates an increasing awareness of the importance and urgency of cultural heritage preservation, benefiting better protection and ongoing transmission of cultural heritage.
To understand the potential hierarchical structure of topics, hierarchical clustering was performed on the relationships between topics, and the clustering results were visualized, as shown in figure 4. It intuitively shows the relationships between topics at different levels. Topic 0 and Topic 1 are closely and directly related. Data preservation inherently involves developing effective preservation models and data description frameworks. The preservation of data relies on corresponding metadata description schemes and preservation architectures, which aligns with our understanding. Topic 2 and Topic 6 are also closely related; these two topics explore the long-term preservation of cultural heritage from different perspectives. Topic 2 focuses more on the technical and practical aspects of digitizing and digitally preserving specific types of cultural heritage. Further clustering of Topic 0 and Topic 1 shows a direct link to Topic 5, indicating that whether it is data preservation, metadata schemes, or preservation systems, all involve the preservation of digital objects, necessitating risk assessment and management. The hierarchical clustering results are consistent with our understanding and confirm the effectiveness of BERTopic's topic identification and clustering results.
Using BERTopic to visualize all topics, an interactive graph was generated as shown in Figure 5, where each circle represents a topic, and its size indicates the frequency of appearance of that topic in all documents. The distance between circles represents the similarity between topics, with closer distances indicating higher similarity between topics. From the graph, all research topics can be divided into 5 topic clusters (this study considers topics with close distances as a topic cluster), with topics within each cluster relatively concentrated, while topics between different clusters are distantly separated, indicating significant semantic differences in research content between different topic clusters and suggesting further exploration of cross-cluster research.
This study conducted a topic analysis of the digital preservation field using the BERTopic model and drew the following conclusions based on the research results:
From the perspective of research topics, data preservation is the most highly regarded topic in the field of digital preservation. This indicates the importance placed on data by various institutions and researchers. Although interest in data preservation has slightly declined in recent years, the surge in large language models highlights the critical importance of data once again. It can be predicted that research on data preservation will experience a resurgence. For institutions and individuals, engaging in data preservation practices and research will serve as an excellent entry point into the field of digital preservation. Models and metadata frameworks related to digital preservation also receive significant attention, underscoring the efforts and determination of institutions and researchers to promote the standardization, normalization, and high-quality development of digital preservation.
Analyzing the development trends of these topics, the peak publication years for different research topics vary, indicating fluctuations in the level of attention each topic receives. This variability may be influenced by the maturity of the current research topics, developments in information technology, changes in legal and policy frameworks, among other factors. Additionally, the possibility that some institutions and researchers are following trending topics cannot be ruled out.
Digital preservation of personal information and cultural heritage has seen a steady rise in attention in recent years, marking them as emerging and popular research topics. This trend is believed to be associated with the rapid advancement of information technology, the growing awareness among relevant institutions regarding the preservation of personal information and cultural heritage, and the ability of information technology to better support these areas.
This study has some limitations. Firstly, regarding data selection, this study only selected literature articles, neglecting other types of documents such as books, patents, reports, etc. In the future, we will conduct further research to include more types of documents to improve the coverage of topic identification. Secondly, when conducting literature retrieval, this study only used commonly used terms in the field of digital preservation. Future-focused projects or repositories related to digital preservation were not included in the analysis scope, which may have affected the analysis of topic development trends. In the future, we will attempt to include more research content for a more comprehensive analysis of topic development trends.
The data for this research is available through github at https://github.com/deeplearninghowto/dp_bibliographic.