Course Price$ 299.00
- The ability to program in one of those languages is required
- Basic knowledge of SQL is required
- Hands-on exercises in Scala and Python will be provided
- Linux command line wil be used
1. Introduction to Apache Hadoop and the Hadoop Ecosystem
- Apache Hadoop Overview
- Data Ingestion and Storage
- Data Processing
- Data Analysis and Exploration
- Other Ecosystem Tools
- Introduction to the Hands-On Exercises
2. Apache Hadoop File Storage
- Apache Hadoop Cluster Components
- HDFS Architecture
- Using HDFS
3. Distributed Processing on an Apache Hadoop Cluster
- YARN Architecture
- Working With YARN
4. Apache Spark Basics
- What is Apache Spark?
- Starting the Spark Shell
- Using the Spark Shell
- Getting Started with Datasets and DataFrames
- DataFrame Operations
5. Working with DataFrames and Schemas
- Creating DataFrames from Data Sources
- Saving DataFrames to Data Sources
- DataFrame Schemas
- Eager and Lazy Execution
6. Analyzing Data with DataFrame Queries
- Querying DataFrames Using Column Expressions
- Grouping and Aggregation Queries
- Joining DataFrames
7. RDD Overview
- RDD Overview
- RDD Data Sources
- Creating and Saving RDDs
- RDD Operations
8. Transforming Data with RDDs
- Writing and Passing Transformation Functions
- Transformation Execution
- Converting Between RDDs and DataFrames
9. Aggregating Data with Pair RDDs
- Key-Value Pair RDDs
- Other Pair RDD Operations
10. Querying Tables and Views with Apache Spark SQL
- Querying Tables in Spark Using SQL
- Querying Files and Views
- The Catalog API
- Comparing Spark SQL, Apache Impala,and Apache Hive-on-Spark
11. Working with Datasets in Scala
- Datasets and DataFrames
- Creating Datasets
- Loading and Saving Datasets
- Dataset Operations
12. Writing, Configuring, and Running Apache Spark Applications
- Writing a Spark Application
- Building and Running an Application
- Application Deployment Mode
- The Spark Application Web UI
- Configuring Application Properties
13. Distributed Processing
- Review: Apache Spark on a Cluster
- RDD Partitions
- Example: Partitioning in Queries
- Stages and Tasks
- Job Execution Planning
- Example: Catalyst Execution Plan
- Example: RDD Execution Plan
14. Distributed Data Persistence
- DataFrame and Dataset Persistence
- Persistence Storage Levels
- Viewing Persisted RDDs
15. Common Patterns in Apache Spark Data Processing
- Common Apache Spark Use Cases
- Iterative Algorithms in Apache Spark
- Machine Learning
- Example: k-means
16. Apache Spark Streaming: Introduction to DStreams
- Apache Spark Streaming Overview
- Example: Streaming Request Count
- Developing Streaming Applications
17. Apache Spark Streaming: Processing Multiple Batches
- Multi-Batch Operations
- Time Slicing
- State Operations
- Sliding Window Operations
- Preview: Structured Streaming
18. Apache Spark Streaming: Data Sources
- Streaming Data Source Overview
- Apache Flume and Apache Kafka Data Sources
- Example: Using a Kafka Direct Data Source
Pankaj Kumar Pathak
- Having 15+ years of Experience in Big Data Hadoop
- Successfullyimplemented and migrated data on demand from existing Traditional(RDBMS,Oracle/Sql server) to Nosql (Cassandra, Mongo db etc) on Hadoop Cluster and provided ample training in India’s and oversees. The topmost Big Corporate houses where I have delivered such things from last 4 and half years are: -
- Times internet: - Implemented Apcahe Spark with Cassandra on @ Hadoop RAC server for collectingMultiple log files.
- Amar Ujala: -For Hadoop cluster planning and sizing with data migration from Sql server to Cassandra.
- TCS: - 3 corporate batches for Hadoop admin and Data warehousing Cassandra Mongodb (Cloudera, Hortonworks).
- HCL info System: - Hadoop Cluster implementing and migration from DB2.
- HCL Technologies: - Hadoop, Spark-Scala, FlumeCassandra Nosql.
- IBM: - 2 Corporate batches for Hadoop clustering, Cloudera Manager and others.
- Dish TV: - Implemented Ware housing on Hadoop cluster.
- UHG: - Implemented Hadoop cluster20 node cluster for Warehousing using Hive/Impala,Mapreduces.
- Genpact:- Hadoop, Spark-Scala, Flume- and R.
- Nucleus software: -For Hadoop cluster planning and sizing for data warehouses through Cassandra.
- Tech Mahindra:- Implemented Spark with Cassandra on @ RAC server for collectingMultiple log files. Migrated db2 data.
- BARC Mumbai:- Hadoop clustering with Spark and Cassandra.
- Providing consultancy to UK base 2 clients for Data Science implementation
Interview Questions & Answer
1) What do you know about the term “Big Data”?
Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that’s why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business in a better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. It also allows the companies to take better business decisions backed by data.
2) What are the five V’s of Big Data?
The five V’s of Big data are as follows:
- Volume – Volume represents amount of data that is growing at a high rate i.e. data volume in Petabytes
- Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
- Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
- Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
- Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue
3) Tell us how big data and Hadoop are related to each other.
Big data and Hadoop are almost synonyms terms. Hadoop is a solution to big data. So, With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.
4) Explain the steps to be followed to deploy a Big Data solution.
The three steps that are followed to deploy a Big Data Solution are –
i. Data Ingestion
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. This extracted data is then stored in HDFS.
Steps of Deploying Big Data Solution
ii. Data Storage
After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access.
iii. Data Processing
The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.
5) What Are the Two Main Parts Of The Hadoop Framework?
The two main parts of Hadoop Framework are:
- Hadoop distributed file system, a distributed file system with high throughput,
- Hadoop MapReduce, a software framework for processing large data sets.
6) What Is HDFS?
HDFS is a file system which is designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
7) What is YARN?
Yet Another Resource Negotiator or we can say YARN is a Next generation MapReduce or MapReduce 2 or MRv2. It is applied in Hadoop 0.23 release to overcome the scalability issue in classic MapReduce framework by dividing the functionality of Job tracker in MapReduce framework into Resource Manager.
8) Define respective components of HDFS and YARN?
The two main components of HDFS are-
- Name Node – This is the master node for processing metadata information for data blocks within the HDFS
- Data Node/Slave node – This is the node which acts as slave node to store the data, for processing and use by the Name Node
In addition to serving the client requests, the Name Node executes any of the two following roles –
- Checkpoint Node – It runs on a different host from the NameNode
- Backup Node- It is a read-only Name Node which contains file system metadata information excluding the block locations
The two main components of YARN are–
- Resource Manager– This component receives processing requests and accordingly allocates to respective Node Managers depending on processing needs.
- Node Manager– It executes tasks on each single Data Node
9) What is a heartbeat in HDFS?
A heartbeat is a signal indicating that it is alive. A Datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or may be task tracker is unable to perform the assigned task.
10) What is Apache Hive?
Apache Hive is a data warehouse software. it is used to facilitate managing and querying large data sets stored in distributed storage. Hive also permits traditional MapReduce programs to customize mappers and reducers when it is inefficient to run the logic in HiveQL.
11) What are the key components of Job flow in YARN architecture?
MapReduce job flow in YARN architecture have below components:
- A Client node, which submits the MapReduce job.
- YARN Node Managers, which launch and monitor the tasks of jobs.
- MapReduce Application Master, which coordinates the tasks running in the MapReduce job.
- YARN Resource Manager, which allocates the cluster resources to jobs.
- HDFS file system is used for sharing job files between the above entities.
12) What is the importance of Application Master in YARN architecture?
Application Master helps in negotiating resources from the resource manager and working with the Node Manager(s) to run and monitor the tasks. Application Master makes request to containers for all map and reduce tasks. As Containers are assigned to tasks, it starts containers by reporting its Node Manager. It collects progress information from all the tasks and values are propagated to user or client node.
13) What do you mean by MapReduce in Hadoop?
MapReduce is a framework for processing huge raw data sets utilizing a large number of computers. It helps to processes the raw data in two phases i.e. Map and Reduce phase. MapReduce programming model can be easily processed on large scale data. It is integrated with HDFS for processing distributed across data nodes of clusters.
14) What are the value/key Pairs in MapReduce framework?
MapReduce framework implements a data model in which data is shown as value/key pairs. Both output and input data to MapReduce framework should be in value/key pairs only.