An Open Source Infrastructure for Preserving Large collections of Digital Objects
Today’s libraries are curating large digital collections, indexing millions of full-text documents, and preserving Terabytes of data for future generations. This means that libraries must adopt new methods for the processing of large amounts of data. And this is exactly where the SCAPE project (www.scape-project-eu) comes into play. The SCAPE project offers an open source infrastructure, as well as a variety of tools and services for the distributed processing of large data sets with a focus on long-term preservation.
In this project context, we are here presenting an open source infrastructure for preserving large collections of digital objects created at the Austrian National Library for quality assurance tasks as part of the management of a large digital book collection. We describe the experimental cluster hardware and the software components used for creating the infrastructure. More concretely, we will show a set of best practices for the data analysis of large document image collections on the basis of Apache Hadoop. Different types of hadoop jobs (Hadoop-Streaming-API, Hadoop MapReduce, and Hive) are used as basic components, and the Taverna workflow description language and execution engine (www.taverna.org.uk) is used for orchestrating complex data processing tasks.