Last week, the Apache Beam youtube channel uploaded all the recorded sessions from the Beam Summit that took place in July 2022, which our ML engineer Konstantin attended to give a talk as well. There is a lot of interesting content out there, maybe even too much! So, we’ll try to make your lives easier by sharing our overall thoughts & takeways on the summit and why we’re excited about Beam’s future. In addition, we’ll share some interesting sessions we think are definitely worth checking out in more detail.
Although Beam is part of the open source community, Google is one the driving forces behind Beam as it’s used at multiple places within the company (with the Dataflow Runner). The keynote talk outlined their efforts for Beam are focused on three axes: growing Beam, learning Beam, and contributing to Beam. We’re excited to see Beam growing even more and gaining more adoption.
Among all the case-studies, deep-dives and general trends talks featured on the summit, our take-aways can be divided into three main themes:
Of course, this list is not exhaustive and there’s more, but the bulk of the sessions can be placed under one of these themes. In the next section, we’ll go over our selected talks of each category worth checking out in more detail.
Starting from Beam 2.40, Machine Learning has gotten a big upgrade in the Apache Beam landscape. In general, there are 2 ways of doing inference in your pipelines. Either you expose your model as an external service via REST or RPC and call it from your pipeline, or you load your model on your working and do local inference.
Before, if you wanted to do the latter, you had to write your own custom DoFn, or use the RunInference utility from the tfx_bxl package (which is only for TensorFlow models). Starting from Beam 2.40, you can now use the built-in RunInference PTransform. What this means, is that there is now a clean and simple unified interface across all internal implementations, which takes care of a lot of boilerplate stuff like dynamic batching, key forwarding, exposing standard metrics, …. Yay!
As Machine Learning specialists, the pandas library is one of our go-to tools to inspect, analyze, and transform structured data. The Beam community also recognized the power of pandas as it has an efficient implementation (thanks to its columnar memory layout and C implementation), a declarative and concise API, and it’s a framework almost all (if not all) Python developers know. However, ML also involves large amounts of data, which can sometimes become a problem for pandas, which is in-memory only. As a result, they came up with a way to combine the best of both worlds by making a pandas-compatible API in Beam. (Note: there are also libraries out there that aim to address this problem, such as Dask or Polars)
After a glimpse of the Beam DataFrame API and how to use it, you’ll get a better understanding of how it works behind the scenes such as the idea of the proxy object and the expression metadata. Additionally, you’ll gain a deeper understanding on both Beam and pandas. For example, you might be surprised to learn that all pandas dataframes have an index and that many operations actually implicitly use it!
Beam is incredibly powerful, but its value is only unlocked when you start using it in production. If you are eager to learn how you can reliably update your pipelines and deploy them across environments, this talk is for you. While it will probably not answer all the questions you might have, it is a good starting point to start digging deeper depending on your exact needs for running CI/CD for Beam pipelines.
The first question you might have is how you can test your Beam pipelines. Luckily, Beam provides the TestPipeline class and utility functions to make your life easier. This enables you to test whether the output matches the expected outcome in Beam paradigm. In addition, you’ll also see how you can test streaming pipelines!
The second step is to deploy your pipelines in a reliable manner. Flex Templates are a great way to achieving this as staging and execution are separate steps. This also allows non-technical users to launch pipelines (from Cloud Console for example).
A great talk by Kenn Knowles takes one step back to look at the bigger picture and the landscape surrounding Beam. This talk offers some great insights in where Beam came from, which paradigm shifts it enabled, and which directions data processing could evolve into in the future. A must-watch to learn more about the data processing concepts surrounding Beam.
Want to know how we have used Beam in one of our projects? This session is for you! Here, our very own Konstantin shares insights with regards to using Beam efficiently as an orchestrator for Machine Learning services to solve real-world problems, such as real-time semantic enrichment and online clustering of text content. We love this talk, but we might be a bit biased here of course ;)
If you want to have a look for any other recordings that could be interesting to you, you can find the full playlist here.
Interested in more Beam content? There is the Ray Summit on which the Ray Beam Runner Project for unified batch, streaming, and ML is presented in more detail. Alternatively, if you’re more interested in stream processing, there’s the Pulsar Summit with a talk on “Beam + Pulsar: Powerful Streaming Processing at Scale”.
Thanks to Matthias Feys