Config Driven Data Transforms (pt1)


There’s no money in manually manipulating data.

I run into a lot of projects where too much bespoke code is created for the sole purpose of transforming one shape of data to another.

Areas of a system where data is exchanged are typically hotspots for change and is why we want to move that change from a code deployment requirement to a configuration concern.

Depending on the language being used in the system, there are several tools available for transforming data with various DSLs (Domain Specific Language)s.

Whether the goal is reducing the maintenance costs within the codebase, to providing self-serve data transformations for non technical users, or somewhere in between- several aspects should be considered when choosing or developing a DSL for your transformation needs.

  • Learning Curve
  • Performance
  • Validation
  • Versioning
  • Testing

Example using schleppy transforms

Consider the following source data:

data = {
    '_id': '62008e17420676bfbd0dac95',
    'index': 0,
    'guid': '2a8f8716-36b0-4bf2-b6ec-e697a9b5658c',
    'isActive': True,
    'balance': '$3,128.45',
    'picture': 'http://placehold.it/32x32',
    'age': 22,
    'eyeColor': 'brown',
    'name': 'Stefanie Bryant',
    'gender': 'female',
    'company': 'DATACATOR',
    'email': 'stefaniebryant@datacator.com',
    'phone': '+1 (959) 409-3601',
    'address': '181 Kingsway Place, Lloyd, California, 9294',
    'about': 'Aute nostrud id ipsum sunt excepteur. Nulla est do eiusmod excepteur. Proident esse minim velit labore. Excepteur incididunt aliqua elit labore nostrud sint labore irure ullamco irure ex. Occaecat laborum voluptate enim dolor. Ex eiusmod nisi non in non deserunt laboris. Irure qui aute excepteur pariatur.\r\n',
    'registered': '2018-02-10T04:19:48 +07:00',
    'latitude': 57.13409,
    'longitude': -27.877626,
    'tags': [
        'minim',
        'pariatur',
        'amet',
        'officia',
        'duis',
        'irure',
        'ullamco'
    ],
    'friends': [
        {
            'id': 0,
            'name': 'Jessie Rojas'
        },
        {
            'id': 1,
            'name': 'Kelley Neal'
        },
        {
            'id': 2,
            'name': 'Angeline Sharpe'
        }
    ],
    'greeting': 'Hello, Stefanie Bryant! You have 2 unread messages.',
    'favoriteFruit': 'apple'
}

And that you need the following output:

{
    'data': {
        'name': 'Stefanie Bryant',
        'contact': {
            'phone': '+1 (959) 409-3601',
            'email': 'stefaniebryant@datacator.com'},
        'location': {
            'address': '181 Kingsway Place, Lloyd, California, 9294',
            'gps': {
                'lat': 57.13409, 'lon': -27.877626
            }
        }
    }
}

Simple reshaping like this can be achieved using sink and source alone in the transform def below:

from plot.transform.processor import TransformProcessor

transformer = {
    'name': 'Getting Started Transform',
    'source': {'path': 'data'},
    'sink': {
        'path': {

            'data.name': 'source.name',
            'data.contact.phone': 'source.phone',
            'data.contact.email': 'source.email',
            'data.location.address': 'source.address',
            'data.location.gps.lat': 'source.latitude',
            'data.location.gps.lon': 'source.longitude'
        }
    }
}
processor = TransformProcessor(execution_context, transformer)

result = processor()

Learning Curve

Simple path traversal to reshape data has a low learning curve, as the DSL is represented by the data itself.

Often, we’ll need to pull in more powerful tools depending on our transform needs, but a little bit can go a long way.


Validation

Validation via standard registries is key to ensuring data integrity, and schema checks can be used during several stages of the transform cycle:

  • Pre-transform (make sure the incoming data is in the right shape)
  • Pre-step (when there is a multi-step DAG for the transformation, we can make sure the section being transformed is in the right shape)
  • Post-step (check that the transform produced the desired shape)
  • Post-transform (final guarantee that the transform was successful)

The most important schema validation is usually the Post-transform, but having several checks (especially with complex transformations) can help pin down potential quality issues.


Versioning

By moving transformation into a DSL document, the same code can process multiple versions of a transformation, making version tracking and evolution much easier.


Testing

An ideal test framework is the one consuming teams use on their existing services. This means we should make our transformation process as agnostic as possible.

JSON input files -> Transformation(s) -> Output JSON files

Providing test and result data in a standard format makes it easy for teams to test and iterate on their transforms, even before deployment.

This also enables the team creating the transform to use the test framework of their choosing for making assertions.