Cloudera Developer Training for Apache Spark Course Content
Why Spark?
- Problems with the Traditional Large-Scale Systems
- Introducing the Spark
Spark Basics
- Apache Spark?
- Using the Spark Shell
- Resilient Distributed Datasets (RDDs)
- Functional Programming with the Spark
Working with the RDDs
- Operations of the RDD
- Key-Value Pair RDDs
- MapReduce & Pair RDD Operations
Hadoop Distributed File System
- Why HDFS?
- Architecture of the HDFS
- Using the HDFS
Running Spark on a Cluster
- A Spark Standalone Cluster
- Spark Standalone Web UI
Parallel Programming with the Spark
- RDD Partitions & HDFS Data Locality
- Working with the Partitions
- Executing Parallel Operations
Caching & Persistence
- RDD Lineage
- Overview of the Caching
- Distributed Persistence
Writing the Spark Applications
- Spark Applications vs. Spark Shell
- Creating the Spark Context
- Configuring the Spark Properties
- Building & Running a Spark Application
- Logging
Spark, Hadoop, & the Enterprise Data Center
- Spark & the Hadoop Ecosystem
- Spark & MapReduce
Spark Streaming
- Example: Streaming Word Count
- Operations of the Other Streaming
- Sliding Window Operations
- Developing the Spark Streaming Applications
Common Spark Algorithms
- Iterative of an Algorithms
- Graph Analysis
- Machine Learning
Improving Spark Performance
- Shared Variables: Broadcast Variables & Accumulators
- Common Performance Issues