February 21, 2021

[Trends] Analytic Engineering with Dbt or DataForm

Analytic Engineers? Or Analytic Engineering?

‍

Another new term in the data world or is it an example of clever marketing? Analytic Engineering is a term pushed by the team behind Dbt, Data Build Tool.
Dbt, and the recently acquired DataForm, are tools that simplify data engineering and data governance for data pipelines expressed in SQL.

The main drivers behind this technology are the rise of cloud-based data warehouses and the SQL knowledge available on the market.

Reference: https://mattturck.com/data2020/ — Data & AI Landscape 2020 by Matt Turck

‍

The shift towards cloud data warehouses

In the data world, we’ve seen the age of challenging to scale relational databases and data warehouse appliances, often on expensive on-premise hardware.
Followed by the introduction of document or key-value databases with flexible schemas but less powerful or more complex query capabilities.
Data processing shifted from traditional ETL tools to large scale distributed data processing, for structured and unstructured data, on massive but difficult to manage Hadoop/Spark clusters. These days cloud platforms and independent vendors are simplifying cluster management in the cloud and offering more flexibility to scale-out clusters based on demand.

On the other hand, modern cloud data warehouses, with an architecture split into powerful but affordable column-based storage paired with flexible compute are a game-changer.
It’s feasible to store large amounts of data and transform or query the data, structured and semi-structured (JSON/XML), using SQL.
The cloud data warehouse automatically translates the SQL statement into highly optimized distributed processing without having to write a single line of cloud.

The power of SQL

SQL, because of the declarative nature and knowledge available in the market, is a continuous theme in the data and AI world.
Initially, data processing and machine learning were only possible by developing programs in Java, Scala, Python etc. The trend these days is to add SQL support to avoid or minimize programming. Great examples are Apache Kafka with KSQL, Apache SparkSQL, FlinkSQL, BigQuery ML, BlazingSQL, BeamSQL.

Why? SQL skills are easier to find in the market than developers with big data or ML experience. People that work on data visualisations know, or learn SQL, faster than Java or Python. Analysts which turn into data scientists often prefer to use SQL instead of programming or relying on a separate data engineering team.

The problem with SQL

SQL can however be difficult to manage.
SQL statements end up in the database as views, scripts, stored procedures or scheduled queries. Dynamic or reusable SQL is hard to write. Dependencies between tables/views often need to be managed manually in cloud data warehouses.
SQL statements mixed together with code can be messy.
Using a modern job orchestrator solves some of these problems because a containerized step or operator can point to a versioned, parameterized, SQL statement or script.

So we need to introduce a bit more software engineering best practices.
And that’s what Dbt and Dataform do.

Dbt — engineering and data governance

Dbt combines configuration files, that describe the available source tables and output artefacts (reports/notebooks/dashboards/…), with data transformations, called models, written using templated SQL.
Everything is stored in files so it’s a no-brainer to version the files in Git as any other code.

Using config files and variables, it’s easy to switch between different environments.
Using materializations it’s a small code change to turn a view into a, truncate/load or incremental/append-only, table.
By combining a bit of code, templates and SQL it’s not that complicated to generate complex SQL based on configuration, variables or query results. Dbtvault, to store data according to the data vault 2.0 modelling principles, is one of the best examples.

Instead of referring to the physical database and table name, it’s recommended to use the “source” and “ref” syntax. Dbt automatically generates a DAG that takes all the dependencies into account and executes all the steps in the correct order.
Additionally, this provides exactly the kind of data lineage end-users and data/ML engineers are looking for.
In the config files, it’s straightforward to add table and column descriptions and data tests.
Data tests are essential because cloud data warehouses often don’t enforce unique or foreign key constraints. And everybody knows that fighting with data quality is, unfortunately, a reality in most data pipelines.
Do you need to document your data pipeline? Dbt uses the config files and models to generate an easy to use and customize the documentation website.

Reference: https://docs.getdbt.com/docs/building-a-dbt-project/exposures

‍

Dbt in the ML6 tech stack

At ML6 we are using BigQuery in a lot of projects.
Dbt has been part of our tech stack for about 1,5 years. The impact has been tremendous.
SQL-based steps in data pipelines are easier to write, complex dynamic logic is written only once and reused multiple times.

Teams that rely on data enjoy the automatically generated docs, data lineage and data tests.

From a technical perspective, we like the flexibility of Dbt.
It’s easy to integrate into our coding and DataOps/MLOps best practices.
The DAG can be executed fully serverless using Cloud Build, containerized using Kubeflow or as part of an Apache Airflow/Cloud Composer DAG.

2021 might be the year of “Analytic Engineering”. We are looking forward to the quickly expanding Dbt ecosystem. We are closely following up how Dataform will be integrated into Google Cloud Platform because the functionality is very similar, but the open-source community is smaller.

Keep an eye on our blog because we’ll post a review of Dbt Cloud, Dataform and a more technical Dbt blog soon.

‍