[Opinion ]  How to build a data architecture to drive innovation :  today and tomorrow
January 28, 2021
Koen Verschaeren

[Opinion ]  How to build a data architecture to drive innovation :  today and tomorrow

Data Architecture

In June 2020 McKinsey published a report called “How to build a data architecture to drive innovation — today and tomorrow”. The report is an excellent view of modern future proof data architecture. The article highlights the advances in data technology and the recommended building blocks and methodologies.

What do we like?

As an AI company, we, as recommended in the report, prefer to use serverless cloud data lake/data warehouse services. This enables affordable scalability and flexibility with no or minimal infrastructure maintenance. Schema-on read is a game-changer for semi-structured data. The focus on a domain-driven approach highlights that the end to end responsibility for the data belongs within the domain that generates the data.

Organizing a data lake into multiple layers with a raw layer, preferably immutable, and curated layers managed by the domain is an excellent approach.
Starting from the curated layer, it’s easy to create new fit for purpose “data products”, in data mesh terms (or data marts for the Kimball generation). A good example is that a data set for an analytic tool often has a different data model or even file format than the same data used for ML model training.

The move away from pre-integrated commercial solutions in favour of a well-picked tech stack is exactly what we do for customers. A mix of SaaS, commercial and plenty of open-source, backed by a large community, is more efficient and cost-effective than hitting the limits of traditional ETL software or (over)-engineering your own data processing frameworks.

We are glad we’ve seen this data architecture in several requests for proposals.
It is a great blueprint but in our opinion, 2 shifts need a bit more thought.

From batch to real-time processing

It makes absolute sense to ingest the data in real-time if:

  • your data is streaming in nature. For example, the internet of things, web analytics, …
  • your applications are based on a microservices design with CQRS on top of Apache Kafka or similar services such as Apache Pulsar.
  • you have several business requirements that combine data from multiple domains in realtime and require real-time decision making

In the majority of cases, we have seen that the business value for real-time processing in the data warehouse context is limited.

  • A lot of data-driven decision making is not real-time. In some cases, we’ve seen daily usage but often insights are used on a weekly, monthly or quarterly basis.
  • From a technical point of view, the tech stack behind a lot of applications is not yet event-based so plenty of refactoring is needed to enable it for every action.
  • The costs of real-time processing are considerably higher and streaming data pipelines are more complex to develop compared to a (micro)-batch approach.
  • In front-end applications, users often request controls to minimize or suppress the number of real-time alerts because it interrupts the flow of their work.
  • Real-time management dashboards are underused because a limited number of employees have the time to analyse and act frequently during the day.

There are also alternatives to get to real-time insights:

  • In operational applications, real-time reporting is often integrated on top of the application database to offer full transactional consistency and input capabilities.
  • Inference with ML models tends to be tightly integrated into the operation application using trained models exposed as REST APIs or the inference is moving even closer to the device using edge processing.

In our data pipeline designs, we aim to process the data at the right time.
As data volumes increase we increase data ingestion rate to avoid long-running processes. In a modern job orchestrator, it’s easy to change the refresh rate. Data processing frameworks, such as Apache Beam/Cloud DataFlow, require minimal changes to switch to streaming or micro-batches in case it’s needed.
And don’t forget that AI offers plenty of opportunities to trigger only relevant alerts.

From point to point to decoupled data access

McKinsey recommends decoupling using APIs managed by an API gateway.
This is a proven approach to decouple micro-services or provide data, with a well-defined application interface and schema, to one or more internal or external applications.
A modern API gateway offers an additional security layer, observability and a documented user-friendly interactive developer portal.

APIs are however not recommended for large scale data exchange.

Large scale data processing frameworks and data science tooling are more efficient with a direct connection to a cloud data warehouse or data files in distributed storage.
Modern drivers and file formats, for example, Apache Parquet or BigQuery storage API support high-speed, zero-copy data access using Apache Arrow. This is more efficient than using a REST API with limited filter capabilities, a limited number of rows for each request and slow JSON deserialization.
Furthermore, the majority of the data visualization tools prefer or only support drivers that support SQL.

We advise spending enough time to analyse the required integrations and take pragmatic design decisions.
It’s perfectly acceptable to process data in the data lake and export a subset of the data to ElasticSearch or an in-memory data store. Alerts can be published on a message or task queue. Integration with internal/external REST API can be tricky if the API is not scalable enough to handle the number of requests modern data processing frameworks generate.

In the report, we suggest moving the data catalog, mentioned in shift 5, to this section.
The data catalog should be the main entry point for business users, data scientists and application developers to discover the available data, check data lineage and the options to access the data. If you like to learn more about the pitfalls of microservices check this point of view.

Conclusion

We are looking forward to an update of the reporting this year because the data landscape for data and ML keeps evolving at a fast pace.
Get in touch with your point of view and in case you have any questions or suggestions.

Related posts

Want to learn more?

Let’s have a chat.
Contact us