1. Introduction to Hadoop
Enterprise Data Trends @ Scale
What is Big Data?
A Market for Big Data
Characteristics of Big Data 3V 5V
7V's of Big Data
Most Common New Types of Data
Moving from Causation to Correlation
What is Hadoop? And Why Hadoop?
Traditional Systems vs. Hadoop
What is Hadoop 2.0?
Overview of a Hadoop Cluster and Core
components of Hadoop
Different distributions of Hadoop
Hadoop Use Case
Lab exercise: - Login to Your Cluster
2. Hadoop Architecture
Characteristics of Hadoop
a. Fault tolerance
b. replication
c. block size
d. robustness
What is node, Rack, Cluster, datacenter and Data
Hub
MapReduce Architecture
HDFS Architecture
Understanding Block Storage
Demonstration:
Understanding Block
Storage
The NameNode
The Data Nodes
HDFS Clients
3. Installing Hadoop Cluster using Cloudera
Manager
Minimum Hardware Requirements
Minimum Software Requirements
A Formidable Starter Cluster
Lab exercise: - Setting up the
Environment
Lab exercise:- Installing Cloudera
Manager and CDH
Lab exercise :- Adding Services to
Cluster
4. Configuring Hadoop
Hadoop configuration files (core, hdfs.
mapred,yarn-site.xml , bigtop_utils , master and slave files)
Configuration Considerations
Deployment Layout
Configuring Hadoop Ports
Configuring HDFS
What Does the File System Check Look
For?
Replication Factor
Understanding Hadoop Logs
What is Cloudera Manager
Configuration via Cloudera Manager
Management Monitoring
REST API and Thrift Server Overview
Lab exercise :-
Commissioning and Decommissioning of nodes
Lab exercise: -
Stopping and Starting CDH Services
Lab exercise: -
Using HDFS Commands, hadoop fsck and syntax and hadoop
dfsadmin command
5. Ensuring Data Integrity
Replication
Placement
Data
Integrity - Writing Data
Data
Integrity - Reading Data
Data
Integrity - Block Scanning
Running
a File System Check
What
Does the File System Check Look for?
hadoop
fsck Syntax
Data
Integrity - File System Check: Commands & Output
Hadoop
dfsadmin Command
NameNode
Information
Changing
the Replication Factor
Lab
exercise: -
Verify Data with Block
Scanner and fsck
7. MapReduce
and YARN
MapReduce
Understanding MapReduce
What is YARN?
YARN Architecture (RM, NM,
AM, Container)
Lifecycle of a YARN
Application
Configuring
YARN
Configuring MapReduce
tools
YARN application logs
YARN CLI
Lab exercise: -
Troubleshooting a MapReduce Job
8.
Job Schedulers
Overview of Job Scheduling
The Built-in Schedulers
Overview of the Capacity
Scheduler
Configuring
the Capacity Scheduler
Defining Queues
Configuring Capacity
Limits
Configuring User Limits
Configuring Permissions
Overview of the Fair
Scheduler
Multi-Tenancy Limits
Lab exercise: Configuring
the Capacity Scheduler
9. Enterprise Data Movement Backup and
Recovery
What should you backup?
HDFS Snapshots
HDFS Data - Backups
HDFS Data - Automate &
Restore
Overview of BDR (Backup
Disaster Recovery)
Lab exercise:- Using HDFS Snapshots
Managing Resources
- Configuring
groups with Static Service Pools
- The Fair
Scheduler
- Configuring
Dynamic Resource Pools
- YARN Memory
and CPU Settings
10.
Hive Administration
Introduction and architecture
of Hive
Comparing Hive with RDBMS
Hive Components-- Hive
MetaStore, HiveServer2, HCatalog
Hive Clients-- beeline
11. Sqoop
Overview of Sqoop
The Sqoop Import Tool
Importing a Table
Importing Specific Columns
The Sqoop Export Tool
Lab exercise: - Using Sqoop
12.
Flume
Flume Introduction
Installing Flume
Flume Configuration
Monitoring Flume
Lab exercise: - Install and Test Flume
13. Oozie
Oozie Overview
Oozie Components
Jobs, Workflows, Coordinators, Bundles
Workflow Actions and Decisions
Oozie Job Submission
Oozie Console
The Oozie CLI
Using the Oozie CLI
Oozie Actions
Lab exercise: Running an Oozie Workflow
15. HBASE
Overview
Why HBASE ?
Architecture
HBASE Components and Daemons
HBASE Administration and Cluster Management
Cluster Monitoring and Troubleshooting
Cloudera Manager Monitoring Features
Configuring Events and Alerts
Monitoring Hadoop Clusters
Troubleshooting Hadoop services
Common Misconfigurations
Monitoring Cluster services using Charts
Using Trigger option
Monitoring JVM Processes
Understanding JVM Memory
Eclipse Memory Analyzer
JVM Memory Heap Dump
Java Management Extensions (JMX)
Garbage Collection Tuninig
16. Commissioning and De-commissioning of
Cluster Nodes
Decommissioning and Commissioning
Nodes
Decommissioning Nodes
Steps for Decommissioning a Node
Decommissioning Node States
Steps for Commissioning a Node
Balancer
Balancer Threshold Setting
Configuring Balancer Bandwidth
Lab exercise :- Commissioning &
Decommissioning Nodes
17. Backup and Recovery
What should you backup?
HDFS Snapshots
HDFS Data - Backups
HDFS Data - Automate & Restore
Hive & Backup
BDR (Backup Disaster Recovery)
Lab exercise:- Using HDFS Snapshots
18. Rack Awareness
Rack Awareness
YARN Rack Awareness
Replica Placement
Rack Topology
Rack Topology Script
Configuring the Rack Topology Script
Lab exercise: Configuring Rack
Awareness
19. Name Node High Availability
NameNode Architecture Cloudera
NameNode High Availability
HDFS HA Components
Understanding NameNode HA
NameNodes in HA
Failover Modes
NameNode Architectures
hdfs haadmin Command
Protecting Metadata Repositories
Lab exercise :- Configure NameNode
High Availability using Cloudera Manager
20. Security in Hadoop
Security Concepts - Why Hadoop
Security is required?
Kerberos Synopsis - How it works?
- Enabling Kerberos via Cloudera Manager
Lab exercise :- Installing and configuring Kerberos
Miscellaneous:
Overview & Architecture of the following:
Kafka
Solr