Hadoop是一種流行的Big Data處理框架。 Python是一種高級編程語言,以其清晰的語法和代碼可讀性而聞名。
在這個以講師為主導的現場培訓中,參與者將學習如何使用Python來處理Hadoop ,MapReduce,Pig和Spark,因為他們會逐步完成多個示例和用例。
在培訓結束時,參與者將能夠:
- 了解Hadoop ,MapReduce,Pig和Spark背後的基本概念
- 將Python與Hadoop分佈式文件系統(HDFS),MapReduce,Pig和Spark結合使用
- 使用Snakebite以編程方式訪問Python HDFS
- 使用mrjob在Python編寫MapReduce作業
- 用Python編寫Spark程序
- 使用Python UDF擴展pig的功能
- 使用Luigi管理MapReduce作業和Pig腳本
聽眾
課程形式
Machine Translated
Introduction
Understanding Hadoop's Architecture and Key Concepts
Understanding the Hadoop Distributed File System (HDFS)
- Overview of HDFS and its Architectural Design
- Interacting with HDFS
- Performing Basic File Operations on HDFS
- Overview of HDFS Command Reference
- Overview of Snakebite
- Installing Snakebite
- Using the Snakebite Client Library
- Using the CLI Client
Learning the MapReduce Programming Model with Python
- Overview of the MapReduce Programming Model
- Understanding Data Flow in the MapReduce Framework
- Map
- Shuffle and Sort
- Reduce
- Using the Hadoop Streaming Utility
- Understanding How the Hadoop Streaming Utility Works
- Demo: Implementing the WordCount Application on Python
- Using the mrjob Library
- Overview of mrjob
- Installing mrjob
- Demo: Implementing the WordCount Algorithm Using mrjob
- Understanding How a MapReduce Job Written with the mrjob Library Works
- Executing a MapReduce Application with mrjob
- Hands-on: Computing Top Salaries Using mrjob
Learning Pig with Python
- Overview of Pig
- Demo: Implementing the WordCount Algorithm in Pig
- Configuring and Running Pig Scripts and Pig Statements
- Using the Pig Execution Modes
- Using the Pig Interactive Mode
- Using the Pic Batch Mode
- Understanding the Basic Concepts of the Pig Latin Language
- Using Statements
- Loading Data
- Transforming Data
- Storing Data
- Extending Pig's Functionality with Python UDFs
- Registering a Python UDF File
- Demo: A Simple Python UDF
- Demo: String Manipulation Using Python UDF
- Hands-on: Calculating the 10 Most Recent Movies Using Python UDF
Using Spark and PySpark
- Overview of Spark
- Demo: Implementing the WordCount Algorithm in PySpark
- Overview of PySpark
- Using an Interactive Shell
- Implementing Self-Contained Applications
- Working with Resilient Distributed Datasets (RDDs)
- Creating RDDs from a Python Collection
- Creating RDDs from Files
- Implementing RDD Transformations
- Implementing RDD Actions
- Hands-on: Implementing a Text Search Program for Movie Titles with PySpark
Managing Workflow with Python
- Overview of Apache Oozie and Luigi
- Installing Luigi
- Understanding Luigi Workflow Concepts
- Demo: Examining a Workflow that Implements the WordCount Algorithm
- Working with Hadoop Workflows that Control MapReduce and Pig Jobs
- Using Luigi's Configuration Files
- Working with MapReduce in Luigi
- Working with Pig in Luigi
Summary and Conclusion