30 | Apache Beam实战冲刺：Beam如何run everywhere?

蔡元楠



该思维导图由 AI 生成，仅供参考

你好，我是蔡元楠。
今天我要与你分享的主题是“Apache Beam 实战冲刺：Beam 如何 run everywhere”。
你可能已经注意到，自第 26 讲到第 29 讲，从 Pipeline 的输入输出，到 Pipeline 的设计，再到 Pipeline 的测试，Beam Pipeline 的概念一直贯穿着文章脉络。那么这一讲，我们一起来看看一个完整的 Beam Pipeline 究竟是如何编写的。
Beam Pipeline一个 Pipeline，或者说是一个数据处理任务，基本上都会包含以下三个步骤：
读取输入数据到 PCollection。
对读进来的 PCollection 做某些操作（也就是 Transform），得到另一个 PCollection。
输出你的结果 PCollection。
这么说，看起来很简单，但你可能会有些迷惑：这些步骤具体该怎么做呢？其实这些步骤具体到 Pipeline 的实际编程中，就会包含以下这些代码模块：
Java
// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline.
Pipeline pipeline = Pipeline.create(options);
PCollection<String> lines = pipeline.apply(
  "ReadLines", TextIO.read().from("gs://some/inputData.txt"));
PCollection<String> filteredLines = lines.apply(new FilterLines());
filteredLines.apply("WriteMyFile", TextIO.write().to("gs://some/outputData.txt"));
pipeline.run().waitUntilFinish();
从上面的代码例子中你可以看到，第一行和第二行代码是创建 Pipeline 实例。任何一个 Beam 程序都需要先创建一个 Pipeline 的实例。Pipeline 实例就是用来表达 Pipeline 类型的对象。这里你需要注意，一个二进制程序可以动态包含多个 Pipeline 实例。

公开

同步至部落

取消

完成

0/2000

荧光笔

直线

曲线

笔记

复制

AI

深入了解
翻译
英语
中文简体
中文繁体
法语
德语
日语
韩语
俄语
西班牙语
阿拉伯语
解释
总结

Apache Beam实战冲刺：Beam如何run everywhere? 本文深入探讨了Apache Beam的实际应用，重点关注了如何在不同环境中运行Beam Pipeline。首先介绍了Beam Pipeline的基本步骤，包括数据读取、操作和输出结果。随后详细讲解了编写完整Beam Pipeline的方法，包括Pipeline实例创建、数据读取、操作和输出结果。此外，还介绍了Beam的延迟运行特性以及在单元测试环境中运行Beam Pipeline的方法。重点介绍了Beam在不同环境中的运行方式，包括直接运行模式、Spark运行模式和Flink运行模式。在直接运行模式下，Beam使用多线程模拟分布式并行处理；而在Spark和Flink运行模式下，Beam提供了相应的Runner来在这些平台上运行Beam Pipeline，并且可以通过命令行参数指定不同的Runner。此外，文章还介绍了在Spark和Flink上运行Beam程序的具体步骤和依赖关系。通过实例代码和详细讲解，本文帮助读者了解了Beam Pipeline的编写和运行方式，以及如何在不同环境中运行Beam程序，展现了Beam的灵活性和通用性。文章还提到了Google Cloud Dataflow作为完全托管的Beam Runner，以及如何在Google Cloud上运行Beam Pipeline。总的来说，本文为读者提供了全面的Beam Pipeline应用指南，使其能够快速了解Beam的灵活性和通用性，以及如何在不同环境中运行Beam程序。思考题：Beam的设计模式是对计算引擎动态选择，它为什么要这么设计？文章内容涵盖了Beam Pipeline的基本步骤、不同环境下的运行方式以及Google Cloud Dataflow的应用，为读者提供了全面的指南。

仅可试看部分内容，如需阅读全部内容，请付费购买文章所属专栏
《大规模数据处理实战》，新⼈⾸单¥59

立即购买

登录后留言

全部留言(5)

最新
精选

suncar
请问一下老师，可不可提供几个获取大量测试数据的网止。谢谢
作者回复: 谢谢留言！我比较推荐kaggle的datasets。
2019-07-01

11
明翼
想问下读者中多少人用beam在生产环境…
2019-07-02
6
5
hugo
runner是如何在多平台，多语言间实现兼容的？像flink，go runner会在本地调用java runner吗
2020-10-23

1
David
请教一下，GCP上同时有Composer/Airflow和Dataflow/Beam两种可以用来完成ETL工作的产品。是否可以讲一下两者的比较，和在技术上如何进行选型？谢谢！
2020-03-04

1
ditiki
请教两个production遇到的问题. In a beam pipeline (dataflow), one step is to send http request to schema registry to validate event schema. A groupby event type before this step and static cache are used to reduce calls to schema registry. How does beam (or the underline runner) optimise IO ? Is it a good practice to use a thread pool for asynchronous http calls ? The event object has a Json (json4s library) payload, each time we try to update the Dataflow pipeline, we get the error says that the Kryo coder generated for the JSON has changed, such that the current pipeline can’t be updated in place. We did a work a round by serialise the Json payload to string in a custom coder, which should be very inefficient. Have you ever seen this before ? Does Kryo generate a different coder at each compile time ? 多谢啦！
2019-07-03



收起评论