Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Overview of Python and Scala

Core Concepts (Theory):

  • Architecture
  • RDD
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Hands-on Workshop: Understanding Basics via the Databricks Environment

  • Exercises utilizing the RDD API
  • Basic action and transformation functions
  • PairRDD
  • Joins
  • Caching strategies
  • Exercises utilizing the DataFrame API
  • Spark SQL
  • DataFrame operations: select, filter, group, sort
  • UDF (User Defined Functions)
  • Exploration of the Dataset API
  • Streaming

Hands-on Workshop: Deployment in the AWS Environment

  • Fundamentals of AWS Glue
  • Distinguishing between AWS EMR and AWS Glue
  • Example jobs in both environments
  • Analysis of pros and cons

Additional Content:

  • Introduction to Apache Airflow orchestration

Requirements

Programming skills (preferably Python or Scala)

Basic knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories