Apache Beam是一個開源的統一編程模型,用于定義和執行並行數據處理管道。它的強大之處在于它能夠同時運行批處理流和流式管道,並且由Beam支持的分布式處理後端之一執行:Apache Apex,Apache Flink,Apache Spark和Google Cloud Dataflow。 Apache Beam可用于ETL(提取,轉換和加載)任務,如在不同存儲介質和數據源之間移動數據,將數據轉換爲更理想的格式以及將數據加載到新系統中。 在這個有指導意義的現場培訓(現場或遠程)中,參與者將學習如何在Java或Python應用程序中實現Apache Beam SDK,該應用程序定義了一個數據處理管道,用于將大數據集分解爲更小的塊,以進行獨立的並行處理。 在培訓結束後,參與者將能夠: 安裝並配置Apache Beam。 使用單一編程模型來執行批處理和流處理,而不是使用其Java或Python應用程序。 在多個環境中執行管道。 聽衆 開發商 課程的格式 部分講座,部分討論,練習和沈重的練習 注意 此課程將在未來可用于Scala。請聯系我們安排。
Machine Translated
Introduction
- Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink
Installing and Configuring Apache Beam
Overview of Apache Beam Features and Architecture
- Beam Model, SDKs, Beam Pipeline Runners
- Distributed processing back-ends
Understanding the Apache Beam Programming Model
- How a pipeline is executed
Running a sample pipeline
- Preparing a WordCount pipeline
- Executing the Pipeline locally
Designing a Pipeline
- Planning the structure, choosing the transforms, and determining the input and output methods
Creating the Pipeline
- Writing the driver program and defining the pipeline
- Using Apache Beam classes
- Data sets, transforms, I/O, data encoding, etc.
Executing the Pipeline
- Executing the pipeline locally, on remote machines, and on a public cloud
- Choosing a runner
- Runner-specific configurations
Testing and Debugging Apache Beam
- Using type hints to emulate static typing
- Managing Python Pipeline Dependencies
Processing Bounded and Unbounded Datasets
Making Your Pipelines Reusable and Maintainable
Create New Data Sources and Sinks
- Apache Beam Source and Sink API
Integrating Apache Beam with other Big Data Systems
- Apache Hadoop, Apache Spark, Apache Kafka
Troubleshooting
Summary and Conclusion