Description
Conference Presentation of this Paper
Abstract – Working with Macintosh files from the 1980 to early 2000’s can be challenging as they are often on difficult to access disks, unique encoding, have resource forks, and often don’t use extensions. Identification of files from a Macintosh formatted disk can be accomplished with identification methods such as PRONOM, but what if those standard methods come up empty or are misleading? This paper investigates a method of identification which can provide better results compared to existing methods. The Macintosh operating system used special extended attribute codes to not only identify the file type, but also the original creating application. Called Type / Creator codes, these extended attributes are often lost or hidden to the user during preservation activities but can provide important information for long term preservation. These codes along with resource forks can provide necessary information for emulation and migration activities.
Keywords – MacOS, Forensics, Identification, Tools
This paper was submitted for the iPRES2024 conference on March 17, 2024 and reviewed by Leontien Talboom, Klaus Rechert and 2 anonymous reviewers. The paper was accepted with reviewer suggestions on May 6, 2024 by co-chairs Heather Moulaison-Sandy (University of Missouri), Jean-Yves Le Meur (CERN) and Julie M. Birkholz (Ghent University & KBR) on behalf of the iPRES2024 Program Committee.
The identification of file formats is an essential part of ensuring their long-term preservation. Many tools are available which use varying methods to identify the countless file formats which may exist. Identification methods vary and often can disagree on the type of format presented to them. The most successful identification usually occurs by having a known specification and inspecting the file’s byte stream for patterns consistent with the specification. Others may use a best guess or simply by extension only. These methods can work well for common files, but for the more obscure, this requires someone to research and input these patterns into a registry such as PRONOM [1]. The byte stream can be easily interpreted if a specification is available or in the case of a proprietary format, assumptions can be made. Even with all the methods available, we often find files which cannot be accurately identified.
For the majority of file systems, a file contains, within a byte stream, all the data needed to open and render the information. More complex formats may require additional files for complete rendering, but in the case of older Macintosh systems, files may need additional hidden resource forks to render properly [2]. Files on Macintosh systems often had hidden attributes in the form of a resource fork and/or additional Finder data, which all contributed to opening and rendering the file properly. Add the fact that a majority of files on a Macintosh never used an extension, thus making these attributes essential. These attributes are easily lost or ignored and cause many problems with standard workflows for processing born-digital content. [3]
One of the attributes contained in the Finder is called a Type / Creator code. Early Macintosh systems did not use the common DOS/Windows method of using extensions to identify files, they used a four-digit code to designate the type of file and another four-digit code to indicate the creating application which wrote the file. This Creator code was a special four-digit code which had to be registered with Apple to avoid confusion. [4] This four-digit code was then added to every file used or written by the application. The application could then define one or more types of files used by the application to further distinguish the file format. The database of these Creator codes was never released to the public but there were efforts to try and document as many as they could discover. One such database, the TCDB created by Ilan Szekely, contains more than 44,000 combinations of Type and Creator codes, making it the biggest database of Macintosh related format identification available. [5] By using this database with other modern tools, identification of files from Macintosh HFS formatted disks can be made quickly and accurately.
Apple dropped support for HFS, so accessing the disks directly on a Modern Macintosh is difficult. Depending on your processing workflow this can be accomplished using a variety of tools, discussed below. The Type / Creator codes are a hidden attribute, so finding this information is paramount when processing a Macintosh HFS disk. This can be done using tools available for many platforms, and current MacOS may still retain these hidden codes. Once the Type and Creator codes are accessed, then a look up in the TCDB database can quickly identify the format and application used to create the file. The following is a list of helpful tools that allow you to gather the Type / Creator codes and access, image, export, or perform other preservation actions on Macintosh files.
When working with older Macintosh disks, especially the 800k DD/DS floppy disks, it is often easier to use a vintage Macintosh to create an image of the disk and transfer the image to a modern computer. Many have also found success with board like a Kryoflux, Greaseweazle, or Fluxengine. If you are working on vintage hardware or emulating the MacOS to access a disk image, the most common tool to use is ResEdit. This is a resource editor application which allows the viewing and editing of the resource fork of a file. It also can display an information panel about the file which includes the Type and Creator.
Another tool which works well on multiple platforms is a tool called HFSExplorer. It is a JAVA program which can access a physical disk or a disk image, display the contents, extract them, or display information such as the Type / Creator codes. The tool comes with a supporting tool called unHFS, which can export the contents of a disk very quickly and when configured properly can also retain the extended attributes such as the resource fork and Finder information in an Apple Double file.
FTKImager is a forensic tool used for making disk images. It’s made for Windows users, but there are command line versions of the tool. The tool can access physical disks, and also process disk images which includes HFS disks. It will allow the viewing of file properties, which include the Type / Creator codes.
ISOBuster is a Windows only tool, which when licensed, can properly access, export, or display extended attributes including the Type / Creator codes. ISOBuster is a great tool, especially for handling hybrid CD-ROM disc formats. It gives you the ability to export the HFS partition or the more common ISO-9660 data.
HexFiend is a simple hex editor and viewer, which is indispensable on the Macintosh. More recent versions can open the Finder Info and Resource fork attributes of any file, if they exist.
Using the command line interface or CLI for accessing attributes is another way to gather information. These can often be used on a Macintosh or Linux system.
To list the attributes of a file:
ls -l@ file
xattr -l file
GetFileInfo file
To get specific info on the Type Creator codes:
Xattr -p com.apple.FinderInfo file
GetFileInfo -t file
GetFileInfo -c file
To access the entire contents of a disk image, the HFSUtils suite of tools is very helpful.
hmount diskimage, then hdir -R
to recursively list the contents which includes the Type / Creator and resource fork information.
Many of these tools allow for exporting of the of the disk contents to a data file and an Apple Double file, or to a MacBinary / AppleSingle file which will include all the extended attributes. The Type / Creator codes within these formats reside in a known location making identification of the format contained within obtainable. If you choose to use the MacBinary / AppleSingle format to preserve files with their resource forks, or come across other archiving formats such as StuffIt, BinHex, CompactPro, and others. The UnArchiver Tool can be used to look inside and gather format information.
With modern systems, an extension is used to identify a format and assign it to a supporting application. Changing the supporting application is a trivial task but it will make the change for all formats with the extension. In some instances, there may be a format for which one file renders better in one application and another renders better in a different application. Take the PNG format. Portable Network Graphic files are raster images, but the specification does allow for private chunks to be added to the format. Macromedia/Adobe Fireworks was one of a few applications that used the private chunks to enhance the format, giving it the ability to hold more than one page and layer. A PNG created by Fireworks, will have the same Type code as a PNG created by Photoshop, but it would have a different Creator code. Each will open in the correct application when accessed.
I compiled a set of files to test identification in PRONOM and then the Type / Creator database. [7]
The files selected are a mixture of formats which are common and less common, some with an extension and others without. Some have extensions that are associated with the wrong software, while others are simply listed as a document. This test data set represents a possible set of files found on a Macintosh formatted HFS disk. See Fig. 1
The set was analyzed by the DROID v6.7.0 tool which was set to use the PRONOM v116 signature release.
The set was then analyzed using the XATTR CLI tool to gather the Type / Creator codes and the corresponding database entry. See Fig. 2
The tests show the PRONOM registry was able to identify 7 of the 11 files in the sample set. One was identified only by extension, meaning there is no binary signature available for the format, only an extension has been registered in PRONOM. Two files show as zero-byte files but have resource forks. These are identified through the Type / Creator code. There are also three PNG files in the sample set. Each was created by a different application but all three only identified as a PNG by PRONOM. The Type / Creator method shows each has a unique creator, thus allowing for better rendering and preservation.
I would like to propose the use of Type / Creator codes as a standard method of identification. I believe current identification methods can be updated or supplemented to make use of these existing sources in this new method in an automated way. Using Python and a couple of libraries, I would like to present a simple script [8] for identification using this method. The Python tool simply takes a file as input, then uses the xattr library to gather the Type / Creator codes. This step also gathers the size of the data & resource fork, then does a lookup in a CSV containing a list of known codes. It then outputs the associated application and other details contained in the source.
There does not seem to be a perfect solution or single process which can correctly identify every file format needing preservation treatment. All we can do is use best practices to discover as much as we can, using available tools and using every piece of information from the files that we can. [9]
The identification method and script discussed in this paper has potential in adding to the available tools. It fills a gap in Macintosh related format specific identification. With more support from the community, it could be enhanced to improve its ability to identify even more formats.