1. Introduction to Hadoop
Enterprise Data Trends @ Scale
What is Big Data?
A Market for Big Data
Characteristics of Big Data 3V 5V 7V's of Big Data
Most Common New Types of Data
Moving from Causation to Correlation
What is Hadoop? And Why Hadoop?
Traditional Systems vs. Hadoop
What is Hadoop 2.0?
Overview of a Hadoop Cluster and Core components of Hadoop
Different distributions of Hadoop
Hadoop Use Case
Lab exercise: - Login to Your Cluster
2. Hadoop Architecture
Characteristics of Hadoop
Enterprise Data Trends @ Scale
What is Big Data?
A Market for Big Data
Characteristics of Big Data 3V 5V 7V's of Big Data
Most Common New Types of Data
Moving from Causation to Correlation
What is Hadoop? And Why Hadoop?
Traditional Systems vs. Hadoop
What is Hadoop 2.0?
Overview of a Hadoop Cluster and Core components of Hadoop
Different distributions of Hadoop
Hadoop Use Case
Lab exercise: - Login to Your Cluster
2. Hadoop Architecture
Characteristics of Hadoop
a. Fault tolerance
b. replication
c. block size
d. robustness
What is node, Rack, Cluster, datacenter and Data Hub
MapReduce Architecture
HDFS Architecture
Understanding Block Storage
Demonstration:
What is node, Rack, Cluster, datacenter and Data Hub
MapReduce Architecture
HDFS Architecture
Understanding Block Storage
Demonstration:
Understanding Block
Storage
The NameNode
The Data Nodes
HDFS Clients
The NameNode
The Data Nodes
HDFS Clients
3. Installing Hadoop Cluster using Cloudera Manager
Minimum Hardware Requirements
Minimum Software Requirements
A Formidable Starter Cluster
Lab exercise: - Setting up the Environment
Lab exercise:- Installing Cloudera Manager and CDH
Lab exercise :- Adding Services to Cluster
4. Configuring Hadoop
Hadoop configuration files (core, hdfs. mapred,yarn-site.xml , bigtop_utils , master and slave files)
Configuration Considerations
Deployment Layout
Configuring Hadoop Ports
Configuring HDFS
What Does the File System Check Look For?
Replication Factor
Understanding Hadoop Logs
What is Cloudera Manager
Configuration via Cloudera Manager
Management Monitoring
REST API and Thrift Server Overview
Lab exercise :-
Commissioning and Decommissioning of nodes
Lab exercise: -
Lab exercise: -
Stopping and Starting CDH Services
Lab exercise: -
Lab exercise: -
Using HDFS Commands, hadoop fsck and syntax and hadoop
dfsadmin command
5. Ensuring Data Integrity
Replication Placement
Data Integrity - Writing Data
Data Integrity - Reading Data
Data Integrity - Block Scanning
Running a File System Check
What Does the File System Check Look for?
hadoop fsck Syntax
Data Integrity - File System Check: Commands & Output
Hadoop dfsadmin Command
NameNode Information
Changing the Replication Factor
Lab exercise: -
Replication Placement
Data Integrity - Writing Data
Data Integrity - Reading Data
Data Integrity - Block Scanning
Running a File System Check
What Does the File System Check Look for?
hadoop fsck Syntax
Data Integrity - File System Check: Commands & Output
Hadoop dfsadmin Command
NameNode Information
Changing the Replication Factor
Lab exercise: -
Verify Data with Block
Scanner and fsck
7. MapReduce
and YARN
MapReduce
Understanding MapReduce
What is YARN?
YARN Architecture (RM, NM, AM, Container)
Lifecycle of a YARN Application
Configuring YARN
Configuring MapReduce tools
YARN application logs
YARN CLI
Lab exercise: - Troubleshooting a MapReduce Job
8. Job Schedulers
Overview of Job Scheduling
The Built-in Schedulers
Overview of the Capacity Scheduler
Configuring the Capacity Scheduler
Defining Queues
Configuring Capacity Limits
Configuring User Limits
Configuring Permissions
Overview of the Fair Scheduler
Multi-Tenancy Limits
Lab exercise: Configuring the Capacity Scheduler
9. Enterprise Data Movement Backup and Recovery
What should you backup?
HDFS Snapshots
HDFS Data - Backups
HDFS Data - Automate & Restore
Overview of BDR (Backup Disaster Recovery)
Lab exercise:- Using HDFS Snapshots
Managing Resources
- Configuring groups with Static Service Pools
- The Fair Scheduler
- Configuring Dynamic Resource Pools
- YARN Memory and CPU Settings
10. Hive Administration
Introduction and architecture of Hive
Comparing Hive with RDBMS
Hive Components-- Hive MetaStore, HiveServer2, HCatalog
Hive Clients-- beeline
11. Sqoop
Overview of Sqoop
The Sqoop Import Tool
Importing a Table
Importing Specific Columns
The Sqoop Export Tool
Lab exercise: - Using Sqoop
12. Flume
Flume Introduction
Installing Flume
Flume Configuration
Monitoring Flume
Lab exercise: - Install and Test Flume
MapReduce
Understanding MapReduce
What is YARN?
YARN Architecture (RM, NM, AM, Container)
Lifecycle of a YARN Application
Configuring YARN
Configuring MapReduce tools
YARN application logs
YARN CLI
Lab exercise: - Troubleshooting a MapReduce Job
8. Job Schedulers
Overview of Job Scheduling
The Built-in Schedulers
Overview of the Capacity Scheduler
Configuring the Capacity Scheduler
Defining Queues
Configuring Capacity Limits
Configuring User Limits
Configuring Permissions
Overview of the Fair Scheduler
Multi-Tenancy Limits
Lab exercise: Configuring the Capacity Scheduler
9. Enterprise Data Movement Backup and Recovery
What should you backup?
HDFS Snapshots
HDFS Data - Backups
HDFS Data - Automate & Restore
Overview of BDR (Backup Disaster Recovery)
Lab exercise:- Using HDFS Snapshots
Managing Resources
- Configuring groups with Static Service Pools
- The Fair Scheduler
- Configuring Dynamic Resource Pools
- YARN Memory and CPU Settings
10. Hive Administration
Introduction and architecture of Hive
Comparing Hive with RDBMS
Hive Components-- Hive MetaStore, HiveServer2, HCatalog
Hive Clients-- beeline
11. Sqoop
Overview of Sqoop
The Sqoop Import Tool
Importing a Table
Importing Specific Columns
The Sqoop Export Tool
Lab exercise: - Using Sqoop
12. Flume
Flume Introduction
Installing Flume
Flume Configuration
Monitoring Flume
Lab exercise: - Install and Test Flume
13. Oozie
Oozie Overview
Oozie Components
Jobs, Workflows, Coordinators, Bundles
Workflow Actions and Decisions
Oozie Job Submission
Oozie Console
The Oozie CLI
Using the Oozie CLI
Oozie Actions
Lab exercise: Running an Oozie Workflow
15. HBASE
Overview
Why HBASE ?
Architecture
HBASE Components and Daemons
HBASE Administration and Cluster Management
Cluster Monitoring and Troubleshooting
Cloudera Manager Monitoring Features
Configuring Events and Alerts
Monitoring Hadoop Clusters
Troubleshooting Hadoop services
Common Misconfigurations
Monitoring Cluster services using Charts
Using Trigger option
Monitoring JVM Processes
Understanding JVM Memory
Eclipse Memory Analyzer
JVM Memory Heap Dump
Java Management Extensions (JMX)
Garbage Collection Tuninig
Oozie Overview
Oozie Components
Jobs, Workflows, Coordinators, Bundles
Workflow Actions and Decisions
Oozie Job Submission
Oozie Console
The Oozie CLI
Using the Oozie CLI
Oozie Actions
Lab exercise: Running an Oozie Workflow
15. HBASE
Overview
Why HBASE ?
Architecture
HBASE Components and Daemons
HBASE Administration and Cluster Management
Cluster Monitoring and Troubleshooting
Cloudera Manager Monitoring Features
Configuring Events and Alerts
Monitoring Hadoop Clusters
Troubleshooting Hadoop services
Common Misconfigurations
Monitoring Cluster services using Charts
Using Trigger option
Monitoring JVM Processes
Understanding JVM Memory
Eclipse Memory Analyzer
JVM Memory Heap Dump
Java Management Extensions (JMX)
Garbage Collection Tuninig
16. Commissioning and De-commissioning of Cluster Nodes
Decommissioning and Commissioning Nodes
Decommissioning Nodes
Steps for Decommissioning a Node
Decommissioning Node States
Steps for Commissioning a Node
Balancer
Balancer Threshold Setting
Configuring Balancer Bandwidth
Lab exercise :- Commissioning & Decommissioning Nodes
17. Backup and Recovery
What should you backup?
HDFS Snapshots
HDFS Data - Backups
HDFS Data - Automate & Restore
Hive & Backup
BDR (Backup Disaster Recovery)
Lab exercise:- Using HDFS Snapshots
18. Rack Awareness
Rack Awareness
YARN Rack Awareness
Replica Placement
Rack Topology
Rack Topology Script
Configuring the Rack Topology Script
Lab exercise: Configuring Rack Awareness
19. Name Node High Availability
NameNode Architecture Cloudera
NameNode High Availability
HDFS HA Components
Understanding NameNode HA
NameNodes in HA
Failover Modes
NameNode Architectures
hdfs haadmin Command
Protecting Metadata Repositories
Lab exercise :- Configure NameNode High Availability using Cloudera Manager
20. Security in Hadoop
Security Concepts - Why Hadoop Security is required?
Kerberos Synopsis - How it works?
- Enabling Kerberos via Cloudera Manager Lab exercise :- Installing and configuring Kerberos
Miscellaneous:
Overview & Architecture of the following:
Kafka
Solr