Big Data in Practice using Hadoop


Nowadays everybody seems to be working with "big data". Do you also want to interrogate your several data sources (click streams, social media, relational data, sensor data, ...) and are you experiencing the shortcomings of traditional data tools? Maybe you are in need of distributed data stores like HDFS and a MapReduce infrastructure like Hadoop's?

This course builds on the foundations laid in the course 'Big Data Concepts. It includes hands-on practical sessions on Linux with Apache Hadoop, Pig and Hive. Attendees will learn how to implement robust data processing with an SQL-style interface that generates MapReduce jobs. They will also learn to work with the graphical tools that allow for easy follow-up of the jobs and the workflows on the distributed Hadoop cluster.

On successfully completing this course, you will have gained sufficient basic expertise to set up a Hadoop cluster, to import data into HDFS, and to interrogate it clevery using MapReduce.

This course is also available for one-company, on-site presentations and for live presentation over the Internet, via the Virtual Classroom Environment service.

What you will learn

On successful completion of this course you will be able to:

  • describe the concepts and principles of Hadoop
  • explain the use of MapReduce
  • write MapReduce programs
  • understand and use MapReduce components
  • use the Pig interface
  • use the Hive interface
  • describe Hbase and Cassandra.

Who Should Attend

Whoever wants to start practising "big data": developers, data architects, and anyone who needs to work with big data technology.

Prerequisites

Familiarity with the concepts of data stores and more specifically of "big data" is essential (see the course Big Data Concepts. Additionally, some basic knowledge of SQL, UNIX and Java will be useful. Experience with a programming language (Java, PHP, Python, Perl, C++ or C#) is a must.

Duration

2 days

Fee (per attendee)

P.O.A.

 

This includes free online 24/7 access to course notes.

 

Hard copy course notes are available on request from rsmshop@rsm.co.uk

at £50.00 plus carriage per set.

Course Code

BDHA

Contents

Motivation for Hadoop & Base Concepts

The Apache Hadoop project and the components of Hadoop; HDFS: the Hadoop Distributed File System; MapReduce: what and how; The workings of a Hadoop cluster.

Writing a MapReduce Program

Implementing MapReduce drivers, mappers, and reducers in Java; Writing Mappers and Reducers by use of an other progamming or scripting language (e.g. Perl); Unit testing; Writing partitioners for optimizing the load balancing; Debugging a MapReduce program.

Data Input / Output

Reading and writing sequential data from a MapReduce program; The use of binary data; Data compression.

Frequently used MapReduce Components

Sorting, searching, and indexing of data; Word counts and counting pairs of words.

Working with Hive and Pig

Pig as a high-level basic interface for letting generate a sequence of MapReduce jobs; Hive as a high-level SQL-style interface to generate a sequence of MapReduce jobs.

Introduction to Hbase and Cassandra

HBase and Cassandra as alternative data stores.


© RSM Technology 2022