Course Price$ 299.00
- Acquire, store, and analyze data using features in Pig, Hive, and Impala
- Perform fundamental ETL (extract, transform, and load) tasks with Hadoop tools
- Use Pig, Hive, and Impala to improve productivity for typical analysis tasks
- Join diverse datasets to gain valuable business insight
- Perform interactive, complex queries on datasets
1. Introduction to Apache Hadoop Fundamentals
- The Motivation for Hadoop
- Hadoop Overview
- Data Storage: HDFS
- Distributed Data Processing:
- YARN, MapReduce, and Spark
- Data Processing and Analysis:
- Pig, Hive, and Impala
- Database Integration: Sqoop
- Other Hadoop Data Tools
- Exercise Scenarios
2. Introduction to Apache Pig
- What is Pig?
- Pig’s Features
- Pig Use Cases
- Interacting with Pig
3. Basic Data Analysis with Apache Pig
- Pig Latin Syntax
- Loading Data
- Simple Data Types
- Field Definitions
- Data Output
- Viewing the Schema
- Filtering and Sorting Data
- Commonly Used Functions
4. Processing Complex Data with Apache Pig
- Storage Formats
- Complex/Nested Data Types
- Built-In Functions for Complex Data
- Iterating Grouped Data
5. Multi-Dataset Operations with Apache Pig
- Techniques for Combining Datasets
- Joining Datasets in Pig
- Set Operations
- Splitting Datasets
6. Apache Pig Troubleshooting and Optimization
- Troubleshooting Pig
- Using Hadoop’s Web UI
- Data Sampling and Debugging
- Performance Overview
- Understanding the Execution Plan
- Tips for Improving the Performance of Pig Jobs
7. Introduction to Apache Hiv and Impala
- What is Hive?
- What is Impala?
- Why Use Hive and Impala?
- Schema and Data Storage
- Comparing Hive and Impala to Traditional Databases
- Use Cases
8. Querying with Apache Hive and Impala
- Databases and Tables
- Basic Hive and Impala Query Language Syntax
- Data Types
- Using Hue to Execute Queries
- Using Beeline (Hive’s Shell)
- Using the Impala Shell
9. Apache Hive and Impala Data Management
- Data Storage
- Creating Databases and Tables
- Loading Data
- Altering Databases and Tables
- Simplifying Queries with Views
- Storing Query Results
10. Data Storage and Performance
- Partitioning Tables
- Loading Data into Partitioned Tables
- When to Use Partitioning
- Choosing a File Format
- Using Avro and Parquet File Formats
11. Relational Data Analysis with Apache Hive and Impala
- Joining Datasets
- Common Built-In Functions
- Aggregation and Windowing
12. Complex Data with Apache Hive and Impala
- Complex Data with Hive
- Complex Data with Impala
13. Analyzing Text with Apache Hive and Impala
- Using Regular Expressions with Hive and Impala
- Processing Text Data with SerDes in Hive
- Sentiment Analysis and n-grams in Hive
14. Apache Hive Optimization
- Understanding Query Performance
- Indexing Data
- Hive on Spark
15. Apache Impala Optimization
- How Impala Executes Queries
- Improving Impala Performance
16. Extending Apache Hive and Impala
- Custom SerDes and File Formats in Hive
- Data Transformation with
- Custom Scripts in Hive
- User-Defined Functions
- Parameterized Queries
17. Choosing the Best Tool for the Job
- Comparing Pig, Hive, Impala and Relational Databases
- Which to Choose?
Pankaj Kumar Pathak
- Having 15+ years of Experience in Big Data Hadoop
- Successfullyimplemented and migrated data on demand from existing Traditional(RDBMS,Oracle/Sql server) to Nosql (Cassandra, Mongo db etc) on Hadoop Cluster and provided ample training in India’s and oversees. The topmost Big Corporate houses where I have delivered such things from last 4 and half years are: -
- Times internet: - Implemented Apcahe Spark with Cassandra on @ Hadoop RAC server for collectingMultiple log files.
- Amar Ujala: -For Hadoop cluster planning and sizing with data migration from Sql server to Cassandra.
- TCS: - 3 corporate batches for Hadoop admin and Data warehousing Cassandra Mongodb (Cloudera, Hortonworks).
- HCL info System: - Hadoop Cluster implementing and migration from DB2.
- HCL Technologies: - Hadoop, Spark-Scala, FlumeCassandra Nosql.
- IBM: - 2 Corporate batches for Hadoop clustering, Cloudera Manager and others.
- Dish TV: - Implemented Ware housing on Hadoop cluster.
- UHG: - Implemented Hadoop cluster20 node cluster for Warehousing using Hive/Impala,Mapreduces.
- Genpact:- Hadoop, Spark-Scala, Flume- and R.
- Nucleus software: -For Hadoop cluster planning and sizing for data warehouses through Cassandra.
- Tech Mahindra:- Implemented Spark with Cassandra on @ RAC server for collectingMultiple log files. Migrated db2 data.
- BARC Mumbai:- Hadoop clustering with Spark and Cassandra.
- Providing consultancy to UK base 2 clients for Data Science implementation
Interview Questions & Answer
1) What is Hadoop and list its components?
Hadoop is an open-source framework. It is used for storing large data sets and runs applications across clusters of commodity hardware.
It offers extensive storage for any type of data and it can handle endless parallel tasks.
Core components of Hadoop:
Storage unit– HDFS (DataNode, NameNode)
Processing framework– YARN (NodeManager, ResourceManager)
2) What is YARN and explain its components?
Yet Another Resource Negotiator (YARN) is one of the core components of Hadoop. It is responsible for managing resources for the various applications operating in a Hadoop cluster, and also schedules tasks on different cluster nodes.
Resource Manager - It runs on a master daemon and controls the resource allocation in the cluster.
Node Manager - It runs on a slave daemon and is responsible for the execution of tasks for each single Data Node.
Application Master - It maintains the user job lifecycle and resource requirements of individual applications. It operates along with the Node Manager and controls the execution of tasks.
Container - It is a combination of resources such as Network, HDD, RAM, CPU, etc., on a single node
3) Explain HDFS and its components?
HDFS (Hadoop Distributed File System) is the primary data storage unit of Hadoop.
It stores various types of data as blocks in a distributed environment and it follows master and slave topology.
4) What is MapReduce and list its features?
MapReduce is a programming model. It is used for processing and generating large datasets on the clusters with parallel and distributed algorithms.
The syntax for running the MapReduce program is
5) What is Apache Pig?
Apache Pig is a high-level scripting language used for creating programs to run on Apache Hadoop. It is a tool used to deal with huge amount of structured and semi structed data. . It is a platform using which huge datasets are analyzed.
The language used in this platform is called Pig Latin.
It executes Hadoop jobs in Apache Spark, MapReduce, etc.
6) What is Pig Latin?
Pig Latin is a script language which is used in Apache Pig to create Data flow in order to analyze data.
7) List down the benefits of Apache Pig over MapReduce?
Pig Latin is a high-level scripting language while MapReduce is a low-level data processing paradigm.
Without much complex Java implementations in MapReduce, programmers can perform the same implementations very easily using Pig Latin.
Apache Pig decreases the length of the code by approx 20 times (according to Yahoo). Hence, this reduces development time by almost 16 times.
Pig offers various built-in operators for data operations like filters, joins, sorting, ordering, etc., while to perform these same functions in MapReduce is an enormous task.
8) List the various relational operators used in “Pig Latin”?
- ORDER BY
9) What are the different data types in Pig Latin?
Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.
Atomic data types: Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char, byte.
Complex Data Types: Complex data types are Tuple, Map and Bag.
10) How to load data in pig?
A= load ‘/home/training/simple.txt’ using PigStorage ‘|’ as (sname : chararray, sid: int, address:chararray);
11) What is Apache Hive?
Apache Hive offers database query interface to Apache Hadoop. It reads, writes, and manages large datasets that are residing in distributed storage and queries through SQL syntax. In other words, Hive is a data ware software which runs on the top of Hadoop. Hive is tool used for querying and processing a data. Hives store mostly a structured data.
12) What is the Use of Hive?
Hive works as a storage layer which is used to store structured data. This is very useful and convenient tool for SQL user as Hive use HQL.
13) How to managed create a table in hive?
hive>create table student(sname string, sid int) row format delimited fileds terminated by ‘,’;
14) What is Sqoop and what is the use of Sqoop?
Sqoop is a short form of SQL to Hadoop. This is basically a command line tool to transfer data between Hadoop and SQL and vice-versa. Sqoop is a CLI tool which is used to migrate data between RDBMS to Hadoop and vice-versa.
15) List some features of sqoop?
Full Load : Sqoop can load the single table or all the tables in a database using sqoop command.
Incremental Load : Sqoop can do incremental load, which means it will retrieve only rows newer than some previously-imported set of rows.
Parallel import/export : Sqoop is using the YARN framework to import and export the data. The YARN framework provides parallelism as it is read and writes multiple nodes parallelly and fault tolerance is very much possible because by default replication is happening.
Import results of SQL query : It is having the facility to import the result of the query in HDFS.
Compression : Sqoop having the facility to do the compression of the data, what it imports from a database. Sqoop having various options to compress the data. if you specify -compress while importing data, Sqoop compress the output file with grip format by default and it will create an extension as .gz, If you provide -compression-codec instead of compress then Sqoop compress the output with bgip2 format.
Connectors for all major RDBMS Databases : Sqoop having almost all the connectors to connect the relational databases.
Kerberos Security Integration : Sqoop supports Kerberos Authentication, Kerberos Authentication is a protocol which works on the basis of Ticket or key tab which will help you to authenticate user as well as services prior to connect the services like HDFS/HIVE, etc.
14) What is Apache Spark?
Apache Spark is a framework for real-time data analytics in a distributed computing environment. It executes in-memory computations to increase the speed of data processing.
It is 100x faster than MapReduce for large-scale data processing by exploiting in-memory computations and other optimizations.