What is Arc?
Arc is an opinionated framework for defining predictable, repeatable and manageable data transformation pipelines;
- predictable in that data is used to define transformations - not code.
- repeatable in that if a job is executed multiple times it will produce the same result.
- manageable in that execution considerations and logging have been baked in from the start.
Many of these principles have come from 12factor:
- single responsibility components/stages.
- stateless jobs where possible and use of immutable datasets.
- precise logging to allow management of jobs at scale.
- library dependencies are to be limited or avoided where possible.
Not just for data engineers
The intent of the pipeline is to provide a simple way of creating Extract-Transform-Load (ETL) pipelines which are able to be maintained in production, and captures the answers to simple operational questions transparently to the user.
- monitoring: is it working each time it’s run? and how much resource was consumed in creating it?
- devops: is packaged as a Docker image to allow rapid deployment on ephemeral compute.
These concerns are supported at run time to ensure that as deployment grows in uses and complexity it does not become opaque and unmanageable.
Why abstract from code?
From experience a very high proportion of data pipelines perform very similar extract, transform and load actions on datasets. Unfortunately, whilst the desired outcomes are largely similar, the implementations are vastly varied resulting in higher maintenance costs, lower test-coverage and high levels of rework.
The intention of this project is to define and implement an opinionated standard approach for declaring data pipelines which is open and extensible. Abstraction from underlying code allows rapid deployment, a consistent way of defining transformation tasks (such as data typing) and allows abstraction of the pipeline definition from the pipeline execution (to support changing of the underlying execution engines) - see declarative programming.
Currently it is tightly coupled to Apache Spark due to its fault-tolerance, performance and solid API for standard data engineering tasks but the definitions are human and machine readable HOCON (a JSON derivative) allowing the transformation definitions to be implemented against future execution engines.
Why SQL first?
SQL first (based on the Mobile First UX principle) is an approach where, if possible, transformations are done using Structured Query Language (SQL) as a preference. This is because SQL is a very good way of expressing standard data transformation intent in a declarative way. SQL is so widely known and taught that finding people who are able to understand the business context and able to write basic SQL is much easier than finding a Scala developer who also understands the business context (for example).
Currently the HIVE dialect of SQL is supported as Spark SQL uses the same SQL dialect and has a lot of the same functions that would be expected from other SQL dialects. This could change in the future.
If you have suggestions of additional components or find issues that you believe need fixing then please raise an issue. An issue with a test case is even more appreciated.
When you contribute code, you affirm that the contribution is your original work and that you license the work to the project under the project’s open source license. Whether or not you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project’s open source license and warrant that you have the legal authority to do so.
For questions around use (which are not clear from the documentation) post a new ‘issue’ in the questions repository. This repository acts as a forum where questions can be posted, discussed and searched.
For commercial support requests please contact us via email.
Thanks to the following projects:
- Apache Spark for the underlying framework that has made this library possible.
- slf4j-json-logger Copyright © 2016 Savoir Technologies released under the Apache 2.0 License. We have slightly altered their library to change the default logging format.
- azure-sqldb-spark for their Microsoft SQL Server bulkload driver. Currently included in /lib but will be pulled from Maven once available.
- nyc-taxi-data for preparing an easy to use set of real-world data for the tutorial.