In June 2020 McKinsey published a report called “How to build a data architecture to drive innovation — today and tomorrow”. The report is an excellent view of modern future proof data architecture. The article highlights the advances in data technology and the recommended building blocks and methodologies.
As an AI company, we, as recommended in the report, prefer to use serverless cloud data lake/data warehouse services. This enables affordable scalability and flexibility with no or minimal infrastructure maintenance. Schema-on read is a game-changer for semi-structured data. The focus on a domain-driven approach highlights that the end to end responsibility for the data belongs within the domain that generates the data.
Organizing a data lake into multiple layers with a raw layer, preferably immutable, and curated layers managed by the domain is an excellent approach.
Starting from the curated layer, it’s easy to create new fit for purpose “data products”, in data mesh terms (or data marts for the Kimball generation). A good example is that a data set for an analytic tool often has a different data model or even file format than the same data used for ML model training.
The move away from pre-integrated commercial solutions in favour of a well-picked tech stack is exactly what we do for customers. A mix of SaaS, commercial and plenty of open-source, backed by a large community, is more efficient and cost-effective than hitting the limits of traditional ETL software or (over)-engineering your own data processing frameworks.
We are glad we’ve seen this data architecture in several requests for proposals.
It is a great blueprint but in our opinion, 2 shifts need a bit more thought.
It makes absolute sense to ingest the data in real-time if:
In the majority of cases, we have seen that the business value for real-time processing in the data warehouse context is limited.
There are also alternatives to get to real-time insights:
In our data pipeline designs, we aim to process the data at the right time.
As data volumes increase we increase data ingestion rate to avoid long-running processes. In a modern job orchestrator, it’s easy to change the refresh rate. Data processing frameworks, such as Apache Beam/Cloud DataFlow, require minimal changes to switch to streaming or micro-batches in case it’s needed.
And don’t forget that AI offers plenty of opportunities to trigger only relevant alerts.
McKinsey recommends decoupling using APIs managed by an API gateway.
This is a proven approach to decouple micro-services or provide data, with a well-defined application interface and schema, to one or more internal or external applications.
A modern API gateway offers an additional security layer, observability and a documented user-friendly interactive developer portal.
Large scale data processing frameworks and data science tooling are more efficient with a direct connection to a cloud data warehouse or data files in distributed storage.
Modern drivers and file formats, for example, Apache Parquet or BigQuery storage API support high-speed, zero-copy data access using Apache Arrow. This is more efficient than using a REST API with limited filter capabilities, a limited number of rows for each request and slow JSON deserialization.
Furthermore, the majority of the data visualization tools prefer or only support drivers that support SQL.
We advise spending enough time to analyse the required integrations and take pragmatic design decisions.
It’s perfectly acceptable to process data in the data lake and export a subset of the data to ElasticSearch or an in-memory data store. Alerts can be published on a message or task queue. Integration with internal/external REST API can be tricky if the API is not scalable enough to handle the number of requests modern data processing frameworks generate.
In the report, we suggest moving the data catalog, mentioned in shift 5, to this section.
The data catalog should be the main entry point for business users, data scientists and application developers to discover the available data, check data lineage and the options to access the data. If you like to learn more about the pitfalls of microservices check this point of view.
We are looking forward to an update of the reporting this year because the data landscape for data and ML keeps evolving at a fast pace.
Get in touch with your point of view and in case you have any questions or suggestions.