The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Why is Hadoop important?

  • Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.

  • Computing power.Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.

  • Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.

  • Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.

  • Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data

  • Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

CURRICULUM

+

Section: 1-Learn all the buzzwords! And install the Hortonworks Data Platform Sandbox.

  • Introduction, and install Hadoop on your desktop
  • Hadoop Overview and History
  • Overview of the Hadoop Ecosystem

Section: 2-Using Hadoop's Core: HDFS and MapReduce

  • HDFS: What it is, and how it works
  • Install the MovieLens dataset into HDFS using the Ambari UI
  • Install the MovieLens dataset into HDFS using the command line
  • MapReduce: What it is, and how it works
  • How MapReduce distributes processing
  • MapReduce example: Break down movie ratings by rating score
  • Installing Python, MRJob, and nano
  • Code up the ratings histogram MapReduce job and run it
  • Rank movies by their popularity

Section: 3-Programming Hadoop with Pig

  • Introducing Ambari
  • Introducing Pig
  • Introducing Pig Example: Find the oldest movie with a 5-star rating using Pig
  • Find old 5-star movies with Pig
  • More Pig Latin
  • Find the most-rated one-star movie

Section: 4- Programming Hadoop with Spark

  • Why Spark?
  • The Resilient Distributed Dataset (RDD)
  • Find the movie with the lowest average rating - with RDD's
  • Datasets and Spark 2.0
  • Find the movie with the lowest average rating - with DataFrames
  • Movie recommendations with MLLib
  • Filter the lowest-rated movies by number of ratings
  • Check your results against mine!

Section: 5- Using relational data stores with Hadoop

  • What is Hive?
  • Use Hive to find the most popular movie
  • How Hive works
  • Use Hive to find the movie with the highest average rating
  • Integrating MySQL with Hadoop
  • Install MySQL and import our movie data
  • Use Sqoop to import data from MySQL to HFDS/Hive
  • Use Sqoop to export data from Hadoop to MySQL

Section: 6-Using non-relational data stores with Hadoop

  • Why NoSQL?
  • What is HBase
  • Import movie ratings into HBase
  • Use HBase with Pig to import data at scale
  • Cassandra overview
  • Installing Cassandra
  • Write Spark output into Cassandra
  • MongoDB overview
  • Install MongoDB, and integrate Spark with MongoDB
  • Using the MongoDB shell
  • Choosing a database technology
  • Choose a database for a given problem

Section: 7-Querying your Data Interactively

  • Overview of Drill
  • Setting up Drill
  • Querying across multiple databases with Drill
  • Overview of Phoenix
  • Install Phoenix and query HBase with it
  • Integrate Phoenix with Pig
  • Overview of Presto
  • Install Presto, and query Hive with it.
  • Query both Cassandra and Hive using Presto.

Section: 8- Managing your Cluster

  • YARN explanation
  • Tez explanation
  • Use Hive on Tez and measure the performance benefit
  • Mesos explanation
  • ZooKeeper explanation
  • Simulating a failing master with ZooKeeper
  • Oozie explanation
  • Set up a simple Oozie workflow
  • Zeppelin overview
  • Use Zeppelin to analyze movie ratings, part 1
  • Use Zeppelin to analyze movie ratings, part 2
  • Hue overview
  • Other technologies worth mentioning

Section: 9-Feeding Data to your Cluster

  • Kafka explanation
  • Setting up Kafka, and publishing some data.
  • Publishing web logs with Kafka
  • Flume explained
  • Set up Flume and publish logs with it.
  • Set up Flume to monitor a directory and store its data in HDFS

Section: 10-Analyzing Streams of Data

  • Spark Streaming: Introduction
  • Analyze web logs published with Flume using Spark Streaming
  • Monitor Flume-published logs for errors in real time
  • Exercise solution: Aggregating HTTP access codes with Spark Streaming
  • Apache Storm: Introduction
  • Count words with Storm
  • Flink: An Overview
  • Counting words with Flink

Section: 11- Designing Real-World Systems

  • The Best of the Rest
  • Review: How the pieces fit together
  • Understanding your requirements
  • Sample application: consume webserver logs and keep track of top-sellers
  • Sample application: serving movie recommendations to a website
  • Design a system to report web sessions per day
  • Exercise solution: Design a system to count daily sessions

Python Masterclass Certification Pass Guranteed!

  • Complete Your Course

  • Become Certified

  • Impress Your Employer

FAQ

Pre-requisite for the course?

Basic Programming Knowledge(optional

Basic Mathematics Knowledge

What is the duration of course?

40+ hours

Who can take this course?

B.Tech/M.tech/M.Sc. Mathematics/M.Sc Statistics/M.Sc Physics Students

Working professionals

Where you will be after this course?

You will understand Hadoop Architecture

You will be able to working on all of the components of Hadoop Ecosystem

Who will provide the training for this course?

Trainers who are/were working with corporate and have practical understanding of the subjects.

What is special about Excelvisor Technologies?

Excelvisor is a product based 4.0 industry company, we are very professional and we provide industry like environment to their students and we have good links and tie-ups with the corporate, which helps our students in their placements.

CONTACT US

EXCELVISOR TECHNOLOGIES

2nd Floor,No. 4, BTM 6th Stage,2nd Phase,2nd Block, BDA 80 Feet Rd, Muthuraya Swamy Layout, Hanuman Nagar, Hulimavu, Bengaluru, Karnataka 560076

   Email: info@excelvisor.com

SUBMIT YOUR RESUME

Are you looking for the right career
opportunity? Submit your resume in:
resume@excelvisor.com

SUBMIT

GET IN TOUCH