BitCurator Users Forum (BUF) 2016

The BitCurator Users Forum was held in Wilson Library on the campus of the University of North Carolina at Chapel Hill on Friday, January 15th. This was my first time attending, so I wasn’t sure what happened at a BUF, but I came away smarter about digital issues in archiving – the best possible outcome.

Cal Lee gave an introduction to the forum and how he got involved in using forensics software in archival instruction. Several years back, he went to a presentation given by Simpson Garfinkel on his tools for investigating media from donors, and this convinced Cal he needed to learn more – and teach his students these skills as well. Eventually, the BitCurator project was born, and today there are 22 members of the Consortium.

Sam Meister of Educopia then made sure we all knew the hashtag would be #BUF16, and I highly recommend scrolling through the Tweets on the topic if you’re interested in forensics. Further, each session has a Google Doc linked from the schedule.

The first panel looked at ways institutions were ingesting different types of files, including new tools and updated workflows. Porter Olsen from the University of Maryland kicked off the panel by asking how we could use BitCurator and other forensic tools to capture a more holistic picture of the World Wide Web? The web is mainly built on Linux, Apache, and MySQL. What we get when we capture using an Archive-It model is a flattened version, like a snapshot of how the web used to be. The further back you go, the more flattened it is. At UMD, Olsen has been going through two old, retired web servers that he retrieved from a storage closet to see what he could get off of them. Each of the two servers had SCSI drives, RAIDed together to create a logical volume.

The first issue was finding a write-blocker that would work with SCSI drives, though once that was accomplished, the imaging process could begin. By reconstructing these volumes, Olsen has been able to retrieve university departments’ websites that were thought lost. Another compelling reason for examining and capturing web servers is that websites are more than their presentation layers and should be understood in context with other items on the server, including other institutional websites, backups, directories, and development spaces. He’s created a Google Doc for how to assess RAID images within BitCurator, available here.

After Olsen, Bertram Lyons of AV Preserve presented a demonstration of Exactly, which was hands down the most exciting part of the day for me because it has now been added to the workflow at my job. Exactly uses BagIt file packaging, creating a local copy of each package and sending email notifications at the completion of a successful transfer, including a manifest of files and hash values. This tool allows repositories to send donors a configured file that loads FTP information, email authentication, and customized metadata sets. That last part is truly beautiful: the repository can get the donor to fill in metadata information before the files are ingested. Also, small bags can (under 10 MB) be sent via email using Exactly. There’s already a Google group for the tool.

Brian Dietz of North Carolina State University rounded out the panel with a discussion about disk imaging. Most of the time, archivists image because it’s been drilled into us to make the bit-level capture of the original item, but Dietz argued for using TAR. Audience members thought that ZIP might be better than TAR because of ZIP’s transparency, but either way, it meant skipping imaging. NCSU images on a case-by-case basis, which frees up time. BitCurator is still used in the workflows, but for virus scanning, FITS, and bulk_extractor.

Forum participants were then invited to participate in lightning talks, which included Matthew Farrell of Duke University talking about the amount of clicking required in BitCurator and how this had led him to settle on guidelines for electronic records that align with Duke’s processing levels for physical items. Don Mennerich of New York University discussed his institution’s ingest workflow, which uses LIEBEWF to validate E01 disk images. Next, Ben Goldman of Pennsylvania State University discussed the issues with forensic imaging when away from the repository. Also, he was part of the OCLC publication on outsourcing recovery agreements, which made him realize he didn’t want to outsource to vendors. Would it be possible for archivists or repositories to insource workflows and share hardware?

Elizabeth Charlton, archivist at the Marist Archives (NZ), discussed the use of BitCurator in a very small archive – with a very small local user community of three people in all of New Zealand. She is trying to build up the user community in her country and discussed the need for local meet-ups, a topic that came up again later in the forum.

Jarrett Drake of Princeton University discussed using BitCurator on the Morrison collection after 32 5 ¾ inch and 119 3 ½ inch “surprise!” floppies were found within the boxes donated to the university. By using forensic tools, he found 13 Social Security numbers, including Toni Morrison’s, within the disks’ contents. Doug White of the National Institute of Standards and Technology discussed using Retrode within the BitCurator environment to discover metadata locations, checksum algorithms, and error/fault descriptions on old video games.

Kam Woods of the University of North Carolina (and one of the PIs of the BitCurator Access Project) explained why there are so many releases of BitCurator – because the software that is included within the environment are frequently releasing their own fixes, leading to about one release a month for BitCurator right now. Euan Cochrane of Yale University discussed how Yale is using USB networks to add an additional network to the machine they use for disk imaging, which ensures no complications with the standard network. Kari Smith of the Massachusetts Institute of Technology shared her institution’s workflows for reformatting and digital content to show the pop out where forensics happens when needed.

After a break for lunch, the forum held breakout sessions to discuss where access should happen, and I joined the metadata and processing group. Participants raised some interesting questions. With the Rushdie collection at Emory University, users can only access certain things. Does this mean their very expensive emulation project is merely a surrogate for paper? The question was asked that if the information is going to be locked down, might we as well be using microfilm? It doesn’t allow for digital humanities research. Further, special attention is given to certain authors, like Morrison at Princeton. Is this just reinforcing the canon, and should we not advocate for equitable access?

We cannot know the patrons’ research focus, so should we not give them the “bits and bytes?” If the collection is open to the public and researchers discover embarrassing stuff in the collections, what then? Is it the archivist’s job to prevent embarrassing information from being discoverable? As long as it fits within the processing guidelines and doesn’t violate FERPA or HIPPA, then no. Also, are there tools in place to scan the collection for private information? We’re more likely to find that in digital collections than analog collections. Would a community-agreed upon system for delivering metadata allow or minimal standard processing guidelines for electronic records? Finally, finding aids are structured like pieces of paper and we need a better structure, but what should that look like?

After the breakout sessions, Euan Cochrane discussed integrating BitCurator and Preservica for monographs and audiovisual materials at Yale. The question had become what should be done within BitCurator and what should be done within Preservica, because the tools overlap. For Yale, that meant using The Sleuth Kit and fiwalk with BitCurator and doing everything else in Preservica, except create records (which is done through ArchivesSpace). Other integrations were mentioned, though not discussed in depth.

At the end of the forum, during the wrap-up, discussion returned to local meet-ups, which might enable more BitCurator users to attend forum-like events. Hopefully this idea takes off, as I was able to apply elements I learned at the forum to my job almost immediately.