An Open Source Infrastructure for Preserving Large collections of Digital Objects

Sven Schlarb

Description

Today’s libraries are curating large digital collections, indexing millions of full-text documents, and preserving Terabytes of data for future generations. This means that libraries must adopt new methods for the processing of large amounts of data. And this is exactly where the SCAPE project (www.scape-project-eu) comes into play. The SCAPE project offers an open source infrastructure, as well as a variety of tools and services for the distributed processing of large data sets with a focus on long-term preservation.

In this project context, we are here presenting an open source infrastructure for preserving large collections of digital objects created at the Austrian National Library for quality assurance tasks as part of the management of a large digital book collection. We describe the experimental cluster hardware and the software components used for creating the infrastructure. More concretely, we will show a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop. Different types of hadoop jobs (Hadoop-Streaming-API, Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.

Sven Schlarb holds a PhD in Humanities Computer Science from the University of Cologne. Before joining the Research and Development Department of the Austrian National Library in 2008, he worked as a software developer in Cologne and Madrid and as support consultant at SAP in Madrid. His work currently focuses on digitisation and long term preservation