SAA 2015: Session 308, Just Take Those Old Records Off the Shelf: Reconciling Legacy Digital Content with Current Preservation Practice

In advance of the 2015 Annual Meeting, we invited SNAP members to contribute summaries of panels, roundtable and section meetings, forums, and pop-up sessions. Summaries represent the opinions of their individual authors; they are not necessarily endorsed by SNAP, members of the SNAP Steering Committee, or SAA.

Guest Author: Michael Barera, Archivist at Texas A&M University-Commerce

The speakers for this session were Brian Wilson (from The Henry Ford), Katelynd Bucher (from the National Institute of Standards and Technology), and Kat Hagedorn (from University of Michigan Library). The session chair was Lance Stuchell (also from the University of Michigan Library).

Stuchell began by welcoming the audience and then outlining what he saw as the themes among the speakers: legacy materials vs. new materials (is there a competition for resources, time, etc.?), applying lessons learned (are we making the same mistakes?), and the idea that digital preservation is iterative (to which he asked everyone, “We cool with that? Really?”).

Wilson was the first to present, and he began by displaying a Welcome Back, Kotter! lunchbox from The Henry Ford; the image was created on camera on color film, and then transferred digitally to Kodak PhotoCD. Together, The Henry Ford’s PhotoCDs hold over 40,000 images. Their genesis goes back to a 1980s NEH grant designed to shift the museum’s Laser Disc holdings to PhotoCD. Then, after going through planning stages in 1993 and 1994, the Digital Image Library Project ran from 1994 to 2003; it resulted in 1,229 images (about 43,000 frames), 448 PhotoCDs, and 5 binders of image inventories. More recently, a recovery project for the PhotoCDs has become necessary; they were identified as “at risk” in 2010, process and tools for migrating them were developed in 2011 and 2012, and work on the actual migration started in 2013, with the goal of extracting and storing the highest resolution image possible on CD. The Henry Ford’s workflow for this project includes two staff members and three tools (IrfanView, Photoshop, and a file viewer). Wilson then displayed the sequential naming scheme of the PhotoCD data and an IrfanView screenshot before addressing identification, which occurs at two different levels: the disk (with CD ID number, film roll number, object ID number, and object description) and the inventory sheet (which are handwritten and arranged in binders by film roll number). He then notes description, or lack thereof; many records have no title, date, or creator. Wilson then discussed image quality, noting that “in general, the CD images are usable”, although it is lower for CDs, especially for archival objects. So far, there have been a number of recovery results: over 41,000 images extracted (over 80% of the total), 36,000 added to the catalog, and 2,300 now online; in Wilson’s words, “image recovery [is] far more cost effective than re-imaging”. He concluded by touching on a few other issues, including the preservation environment (where he noted that all extracted files are on spinning disk), the fact that the preservation of these images has been evolving since the 1980s, cautions from the past about the then-unknown long-term quality of digital technologies like PhotoCD (quoting a 1994 PC Magazine article), and The Henry Ford’s focus on continuous improvements.

The second presentation was “Bringing the Past into the Present: the NIST Legacy Publications”, by Katelynd Bucher of NIST. Bucher began by describing NIST, the National Institute of Standards and Technology, which is a non-regulatory federal agency with 3,000 science and technology researchers that promotes US innovations and industrial competitiveness; also, it has library and publishing arms, although all of its publications are originally published by the GPO. NIST has numerous legacy collections to which it would like to increase access. The “drivers” for NIST are, according to Bucher, increased demand, the trend toward digitization, and a NIST memo on increasing access. She then outlined the NIST Technical Services Publications, which were first published in 1902, cover 92 years in total, and are made up of approximately 37,000 publications (of which 24,000 are still to be digitized). The selection and planning phase of the process included conducting a survey, creating a complete index (as none existed previously), and addressing dissemination restrictions and time management concerns. Bucher then discussed the Internet Archive, which is the subcontractor for NIST’s digitization process; she described IA as “wonderful” and noted that it is committed to keeping the digital surrogates online “in perpetuity”. She then outlined the Federal Digital Systems (FDsys), which observes metadata standards, addresses digital preservation, conducts a self-audit using TRAC, is recognized for government publications, and allows for authentication and digital signatures. From here, she discussed FDsys implementation, which has consisted of testing (and more testing), decisions (regarding standard vs. custom collections), and navigation (which has been modified to include landing pages, which constitutes [in Bucher’s words] “a big improvement”). She concluded by discussing the finished product (by displaying screenshots demonstrating search capabilities and completed records), customer feedback (which she described as being “very positive” and increasing customer demand for digital publications), and what is “up next” for NIST (namely, FDsys NextGen and simply “more digitizing and deposits”).

The third and final presentation was “Migrating First-Generation Digital Texts from Local Collection to HathiTrust”, by Kat Hagedorn of the University of Michigan Library. She began with an overview of the process, which has existed for six years (since 2009), but only the past three years with stricter validation. From here, Hagedom gave an overview of both DLXS (which has content from 1995 to the present and includes text, images, bibliographies, and finding aids) and HathiTrust (which is a consortium of over 100 research institutions and libraries that is both TRAC certified and part of the Google digitization project). For the two repositories, she notes that some things are similar (such as file formats, structure, staff, and content to be migrated), although HathiTrust has “stretched technical requirements and procedures”. Back in 1995, while long-term preservation was indeed in mind, “they did the best they could with what they had and what existed at the time”; she compares this to the “Weasley House effect” in Harry Potter. From here, she moved on to the “pain point”: while 95% of materials have been easily ingested, 5% have had problems. In Hagedom’s words, it is “not really broken”, but rather about meeting standards of preservation as-is. For the University of Michigan, there have been a number of specific types of problems, including: character errors, mismatched bitonal/contone (“continuous tonal”) images, and sequence skips. In her work, she has found that some types of material are especially difficult to work with and thus should be “touched last”: among these are a batch of materials ingested early on, items with permission/ownership issues, separate contones, and highly-encoded volumes. The “final” result: of 167,000 total volumes, 27,000 have been migrated (16%), while 20 are mostly ingested and 29 waiting. Hagedom notes that the whole process has been “hard, but interesting”; it has been intriguing (in terms of the extent of problems), annoying, and surprising. She then tried to answer the question “how did we do it?”, noting that a “small percentage of files can’t be accessed [and thus] may need to be rescanned” and that “most errors happened in digitization or package creation/loading (not file degradation)”. From here, she asked the question “are we better now?”, coming to the emphatic conclusion “yeah, I’d say so”; furthermore, the migration project “offers an opportunity to evaluate how we did, where we have gone, and where we are now”. In conclusion, she argues that “preservation is iterative…not static”, and notes three key facets of the process: documentation (“important, but harder at scale”), evolution (“metadata/supporting elements will continue to evolve”), and decision-making (“balancing the ideals against what we can do; not a lot of time for thinking twice”).

Questions from the audience concerned the following topics:

  • Decisions about keeping originals: According to Hagedom, “we [the University of Michigan] don’t destroy unless absolutely necessary”, and neither does The Henry Ford; however, NIST are “actually getting rid of copies”.
  • Tension between “legacy stuff” and creating new content: at Michigan the latter has higher priority, at The Henry Ford they piggyback off each other and sometimes compete, and NIST runs new and legacy content in parallel.
  • Quality assurance/quality control and low-resolution TIFFs: Michigan has a formal QA/QC process for locally sourced materials [Google has its own automated process], although it doesn’t catch everything; The Henry Ford has an access-driven digitization program and the lower-resolution TIFFs they have serve them pretty well; and NIST has really strict [and good, it seems] QA/QC processes in place at both IA and FDsys).
  • Born-digital material processed under a previous standard: Hagedom hasn’t been involved with born-digital materials, while Stuchell notes that “this kind of work has to happen…institutions have to understand that things change”.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s