Data Analysis using Pig, Hive, Tableau on Books Rating Dataset

Introduction

This Proof of Concept (POC) involves doing data analysis on Book Rating Dataset using Apache Hive & Apache Pig components. Once the data is analyzed using Hive queries and Pig commands, the summarised data is stored in HDFS and mapped to Hive tables to visualise the data in Tableau.
Dataset used is: Download From Here

Step 1: Download the data and start preparing or formatting the data

Type head BX-Books.csv to see the first few lines of the raw data. You will notice that’s it’s not really comma-delimited; the delimiter is ‘;’. There are also some escaped HTML entities we can clean up, and the quotes around all of the values can be removed.

The first line in the file looks like this:

"ISBN";"Book-Title";"Book-Author";"Year-Of-Publication";"Publisher";"Image-URL-S";"Image-URL-M";"Image-URL-L"

This line defines the data format of the fields in the file..

Open a terminal and enter:

sed 's/\&/\&/g' BX-Books.csv | sed -e '1d' |sed 's/;/$$$/g' | sed 's/"$$$"/";"/g' | sed 's/"//g' > BX-BooksCorrected.txt

This will:

1. Replace all & instances with &

2. Remove the first (header) line

3. Change all semicolons to $$$

4. Change all “$$$” instances into “;” Remove all ” characters

Steps 3 and 4 may look strange, but some of the field content may contain semicolons. In this case, they will be converted to $$$, but they will not match the “$$$” pattern, and will not be converted back into semicolons and mess up the import process.

The command “sed” is Find and Replace Text Inside a File Using RegEx.

More details about "sed" command Click Here

Repeat Step 1 for all the files.

Step 2: Loading data into HDFS

hdfs dfs -mkdir /POCs/POC2/

hdfs dfs -mkdir /POCs/POC2/input_data

hdfs dfs -copyFromLocal BX-BooksCorrected.txt /POCs/POC2/input_data/

hdfs dfs -copyFromLocal BX-BooksCorrected.csv /POCs/POC2/input_data/

hdfs dfs -copyFromLocal BX-Book-Ratings.csv /POCs/POC2/input_data/

hdfs dfs -copyFromLocal BX-Book-RatingsCorrected.csv /POCs/POC2/input_data/

hdfs dfs -copyFromLocal BX-Book-RatingsCorrected.txt /POCs/POC2/input_data/

hdfs dfs -copyFromLocal BX-Users.csv /POCs/POC2/input_data/

Step 3: Data analysis using Pig commands

A relation is a bag (more specifically, an outer bag).

A bag is a collection of tuples.Example: {(19,2), (18,1)}

A tuple is an ordered set of fields. Example: (19,2)

A field is a piece of data.

A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table.

Unlike a relational table, however, Pig relations don't require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

Relations are referred to by name (or alias). Names are assigned by you as part of the Pig Latin statement. In this example the name (alias) of the relation is A.

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

DUMP A;

(John,18,4.0F)

(Mary,19,3.8F)

(Bill,20,3.9F)

(Joe,18,3.8F)

Load the Books data into a Pig collection:

books = LOAD '/POCs/POC2/input_data/BX-BooksCorrected.txt' USING PigStorage(';') AS (ISBN:chararray, BookTitle:chararray, BookAuthor:chararray, YearOfPublication:int, Publisher:chararray);

DESCRIBE books;

Finding books by year

Group the collection by year of publication:

groupByYear = GROUP books BY YearOfPublication;

DESCRIBE groupByYear;

Generate book count by year:

countByYear = FOREACH groupByYear GENERATE CONCAT((chararray)$0,CONCAT(',',(chararray)COUNT($1)));

DESCRIBE countByYear;

DUMP countByYear;

STORE countByYear INTO '/POCs/POC2/output_data/countByYear';

Step 4: Data analysis using Hive queries

Hive architecture

Load the data into a Hive table:

CREATE TABLE IF NOT EXISTS BookData (ISBN STRING, BookTitle STRING, BookAuthor STRING, YearOfPublication INT, Publisher STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' STORED AS TEXTFILE;

LOAD DATA INPATH '/POCs/POC2/input_data/BX-BooksCorrected.txt' OVERWRITE INTO TABLE BookData;

DESCRIBE BookData;

SELECT YearOfPublication, COUNT(BookTitle) FROM BookData GROUP BY YearOfPublication;

CREATE TABLE BookData_Sum As SELECT YearOfPublication, COUNT(BookTitle) FROM BookData WHERE YearOfPublication > 0;

INSERT OVERWRITE TABLE BookData SELECT BookData.* FROM BookData WHERE YearOfPublication > 0;

Generating the final results is again a single query:

SELECT Publisher, BookAuthor, YearOfPublication, COUNT(BookTitle) FROM BookData GROUP BY Publisher, BookAuthor, YearOfPublication;

CREATE TABLE BookData_Sum_PubAuYear As SELECT Publisher, BookAuthor, YearOfPublication, COUNT(BookTitle) FROM BookData GROUP BY Publisher, BookAuthor, YearOfPublication;

Step 5: Visualize the summarized data into Tableau

Download Tableau Desktop

b. For connecting to Hadoop Hive, download the Hive connector

http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive/odbc/hive-odbc-v2-5-14.html

c. Create DSN Name

d. Make sure Hive Server is running your Hadoop cluster if not run the below command

hive --service hiveserver2 &

e. Please follow the below link, to connect Hive table and visualize summarized data as charts in the Tableau

https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_hortonworkshadoop.html

Happy Learning !!!

Data Analysis using Pig, Hive, Tableau on Books Rating Dataset | Proof of Concept (POC) 2

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

When to use aggregateByKey RDD transformation in PySpark | PySpark 101 | Part 14

Module 3.1: Architecture of Apache Spark Real-Time Project 3

RDD(Resilient Distributed Datasets) in Apache Spark

Create Virtual Machine(VM) and Install Ubuntu | Step By Step | Part 2