Catmandu: boost your data processing with library oriented ETL

By:

Nicolas Steenlant

Description:

To create any data oriented application, one of your recurring tasks will be to import data from various sources, map the fields to a common data model and put it all into a database or search engine.

Stores such as MongoDB or ElasticSearch provide a developer friendly API, but you keep writing a lot of boilerplate or throwaway code. We tried to abstract this problem into a set of Perl tools called Catmandu which can work with library data such as MARC, Dublin Core, EndNote, protocols such as OAI-PMH, SRU and repositories such as DSpace and Fedora.

In data warehouses these processes are called ETL, Extract, Transform, Load. Many (often heavyweight) tools exist for ETL processing but none address typical library data models and services.

In this bootcamp we will provide an introduction into these tools. We will show how easy it is to import data and transform it with the help of a small DSL language. Storing and indexing become one-liners.

Audience:

Developers, sysadmins

Expertise:

Importing metadata from various sources, transforming this data into a JSON model of choice, storing/indexing in a (search) engine of choice, provide a REST based API.

Programming experience:

Scripting languages of choice Perl, Python, Ruby, PHP.

Required:

Laptop with GNU/Linux or OSX or a Virtual Machine.