February 16, 2021

Data Mess? Or Data Mesh?

Data is the “fuel” for ML. Yet, in many projects, it is still challenging to get the data right due to a lack of documentation, data quality issues, missing historical data, scalability issues with data platforms and overall lack of ownership.

At ML6 we recommend our customers to look into “Data Mesh”.
You might have noticed multiple people, blog posts and white papers mention the concept in the last few months. It was even picked up by McKinsey in “How to build a data architecture to drive innovation — today and tomorrow” and was one of the architecture trends in 2020.

‍

Data Mesh is all about the roles and responsibilities related to data and the technical and functional requirements for a future proof data platform for analytics and AI.

‍

We are convinced it will improve the way an organisation works with data. However, as we noticed that the concept is not easily understood, this blog post explains data mesh in three easy to understand principles.

‍

The origins of “Data Mesh”

Zhamak Dehghani has pitched data Mesh with the original post “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” on the Martin Fowler website.
A follow-up blog post, “Data Mesh Principles and Logical Architecture”, has been published and Zhamak’s view on the available technology will be available in 2021.

Our interpretation of Data Mesh

When we shared the original blogpost internally and with several customers, we noticed the concepts are not easily understood.

People with a technical background immediately started looking for technical solutions and ignored the required organisational changes. Managers responsible for data engineering or traditional business intelligence teams try to ignore the concepts that impact their role and responsibilities. C-level executives don’t have time to read the in-depth blog post and ask for a concise summary and next steps.

I’ve summarised the main principles behind a data mesh in 3 easy to understand principles.

Copyright — https://martinfowler.com/articles/data-monolith-to-mesh.html

‍

Each domain is responsible and accountable for making the data the domain generates available as a data product.

A data product is

Easy to find and consume using the data tools and applications used within the organisation
Offers all the relevant current and historical data
Documented
Trustworthy, so the data needs to be available in time with a certain level of data quality
Secure

‍

2. You need a centrally managed Data Platform Team with a focus on

Creating “self-service” building blocks for domains to easily create highly scalable data products
Supporting domains by hosting the data products
Making sure everything is running smoothly in a secure and fully compliant environment
Define Data/MLOps principles and tools
Monitor the data/ML pipelines

Besides the technical components, you also need to take care of governance.

3. Centralised governance is required.

Use tools that follow open-standards to ensure interoperability and easy but secure data access and integration
More functional data governance is essential to ensure the data products follow the same data definitions
That data quality issues are acknowledged, tracked and addressed in the roadmap of each domain.

What happens if you apply this in an organisation?

First of all, the functional responsibility for making data available is decentralised.

By making the domain that generates the data responsible and accountable for fully exposing their data as a data product, you avoid several issues:

The domain knows the business logic behind the data, so you avoid data engineers, data analysts and data scientists having to reverse engineer what happens
The domain has a view on the roadmap and (data) related change requests so the data product has to follow the release cycle and that’s often forgotten if the responsibility for data is part of another team.
The domain can’t hide for data quality issues that affect other domains, no more “it works for us in the application” discussions.

Data mesh challenges

The role of the data platform team and the type of tooling required creates a lot of confusion.

For example, what’s self-service?!?
Is it a click on a few buttons in a SaaS application to magically sync all the data? Can it be enough to explain the operators available in Apache Airflow to move data from an operational application to distributed storage and load it into a cloud data warehouse?

First of all, look into the data volumes and types of use-cases on top of the data products. In case the tech stack of the domain can support the data product fully according to the data mesh rules, you don’t need to move the data. Don’t move it, keep it decentralised.

Secondly, you need to analyse the type of data stores and SaaS services used by the domains. Based on this, it’s feasible to define a company-specific technology stack and offer enough guidance for domains to get started with support by a central pool of data/analytic and ML engineers.
If domains need a lot of support, it is recommended to add these skills to the domain’s team permanently, but make sure these team members regularly align with people with similar roles and responsibilities and the data platform team.

We do advise to centralise the job orchestration fully.
It’s essential to have an in-depth view of what’s happening in the production environment.
The data platform team acts as a gatekeeper to ensure that everything meets the quality standards and the required support processes are in place.

Depending on the type of organisation and business processes, the maturity of data management varies enormously. For certain entities, such as customers, the entire data lifecycle is often fully regulated, documented, and data is stored with unique identifiers across all applications. This is not the case with significant problems in other business processes as soon as you start combining data.

The preferable centralised data governance team is responsible for keeping track of metadata, data definitions including KPIs, data quality and working towards maximum interoperability.

Conclusion

Some parts of the data mesh methodology and how they map onto technology are not fully clear yet and highly depend on your company’s culture, budget, and tech stack.

In the market, we see numerous competitors and technology companies using data mesh to sell advanced data management, DataOps/MLOps software and data virtualisation solutions.
Depending on your organisation, these solutions will bring value, but we have seen fantastic results by keeping it simple using technologies that are readily available and affordable today.

We’ve seen great results at our customers with domains that, with a bit of data engineering support, permanently sync data to BigQuery using fully serverless data pipelines scheduled and monitoring using Cloud Composer/Apache Airflow.
This provides enough observability and data lineage for technical teams.
Google Data Catalog enables data discovery with some additional tags and a link to a wiki for more in-depth info.
Each domain publishes several dashboards with the latest view of the performance/timely arrival of the data pipelines, data quality KPIs and examples of operational/management dashboards that showcase the data in the domain.
Since the data in BigQuery is easy to access and fully available, the domain and data engineering team’s time on point to point data syncs decreased.
This inspired other domains to move more data into BigQuery.
Data quality issues surface a lot faster and are easier to track and address.

Well, this is our interpretation of Data Mesh.
Get in touch if you have questions or comments!

In the future, we’ll publish more blog posts that highlight modern technologies that are useful in the context of data mesh and how we see modern data teams to speed up the adoption of advanced analytics and AI.

‍