Introduction
This Proof of Concept (POC) involves doing data analysis on Book Rating Dataset using Apache Hive & Apache Pig components. Once the data is analyzed using Hive queries and Pig commands, the summarised data is stored in HDFS and mapped to Hive tables to visualise the data in Tableau.
Dataset used is: Download From Here
Step 1: Download the data and start preparing or formatting the data
Type head BX-Books.csv to see the first few lines of the raw data. You will notice that’s it’s not really comma-delimited; the delimiter is ‘;’. There are also some escaped HTML entities we can clean up, and the quotes around all of the values can be removed.
The first line in the file looks like this:
"ISBN";"Book-Title";"Book-Author";"Year-Of-Publication";"Publisher";"Image-URL-S";"Image-URL-M";"Image-URL-L"
This line defines the data format of the fields in the file..
Open a terminal and enter:
sed 's/\&/\&/g' BX-Books.csv | sed -e '1d' |sed 's/;/$$$/g' | sed 's/"$$$"/";"/g' | sed 's/"//g' > BX-BooksCorrected.txt
This will:
1. Replace all & instances with &
2. Remove the first (header) line
3. Change all semicolons to $$$
4. Change all “$$$” instances into “;” Remove all ” characters
Steps 3 and 4 may look strange, but some of the field content may contain semicolons. In this case, they will be converted to $$$, but they will not match the “$$$” pattern, and will not be converted back into semicolons and mess up the import process.
The command “sed” is Find and Replace Text Inside a File Using RegEx.
More details about "sed" command Click Here
Repeat Step 1 for all the files.
Step 2: Loading data into HDFS
hdfs dfs -mkdir /POCs/POC2/
hdfs dfs -mkdir /POCs/POC2/input_data
hdfs dfs -copyFromLocal BX-BooksCorrected.txt /POCs/POC2/input_data/
hdfs dfs -copyFromLocal BX-BooksCorrected.csv /POCs/POC2/input_data/
hdfs dfs -copyFromLocal BX-Book-Ratings.csv /POCs/POC2/input_data/
hdfs dfs -copyFromLocal BX-Book-RatingsCorrected.csv /POCs/POC2/input_data/
hdfs dfs -copyFromLocal BX-Book-RatingsCorrected.txt /POCs/POC2/input_data/
hdfs dfs -copyFromLocal BX-Users.csv /POCs/POC2/input_data/
Step 3: Data analysis using Pig commands
A relation is a bag (more specifically, an outer bag).
A bag is a collection of tuples.Example: {(19,2), (18,1)}
A tuple is an ordered set of fields. Example: (19,2)
A field is a piece of data.
A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table.
Unlike a relational table, however, Pig relations don't require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.
Relations are referred to by name (or alias). Names are assigned by you as part of the Pig Latin statement. In this example the name (alias) of the relation is A.
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
Load the Books data into a Pig collection:
books = LOAD '/POCs/POC2/input_data/BX-BooksCorrected.txt' USING PigStorage(';') AS (ISBN:chararray, BookTitle:chararray, BookAuthor:chararray, YearOfPublication:int, Publisher:chararray);
DESCRIBE books;
Finding books by year
Group the collection by year of publication:
groupByYear = GROUP books BY YearOfPublication;
DESCRIBE groupByYear;
Generate book count by year:
countByYear = FOREACH groupByYear GENERATE CONCAT((chararray)$0,CONCAT(',',(chararray)COUNT($1)));
DESCRIBE countByYear;
DUMP countByYear;
STORE countByYear INTO '/POCs/POC2/output_data/countByYear';
Step 4: Data analysis using Hive queries
Hive architecture
Load the data into a Hive table:
CREATE TABLE IF NOT EXISTS BookData (ISBN STRING, BookTitle STRING, BookAuthor STRING, YearOfPublication INT, Publisher STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' STORED AS TEXTFILE;
LOAD DATA INPATH '/POCs/POC2/input_data/BX-BooksCorrected.txt' OVERWRITE INTO TABLE BookData;
DESCRIBE BookData;
SELECT YearOfPublication, COUNT(BookTitle) FROM BookData GROUP BY YearOfPublication;
CREATE TABLE BookData_Sum As SELECT YearOfPublication, COUNT(BookTitle) FROM BookData WHERE YearOfPublication > 0;
INSERT OVERWRITE TABLE BookData SELECT BookData.* FROM BookData WHERE YearOfPublication > 0;
Generating the final results is again a single query:
SELECT Publisher, BookAuthor, YearOfPublication, COUNT(BookTitle) FROM BookData GROUP BY Publisher, BookAuthor, YearOfPublication;
CREATE TABLE BookData_Sum_PubAuYear As SELECT Publisher, BookAuthor, YearOfPublication, COUNT(BookTitle) FROM BookData GROUP BY Publisher, BookAuthor, YearOfPublication;
Step 5: Visualize the summarized data into Tableau
b. For connecting to Hadoop Hive, download the Hive connector
http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive/odbc/hive-odbc-v2-5-14.html
c. Create DSN Name
d. Make sure Hive Server is running your Hadoop cluster if not run the below command
hive --service hiveserver2 &
e. Please follow the below link, to connect Hive table and visualize summarized data as charts in the Tableau
https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_hortonworkshadoop.html
Happy Learning !!!
0 Comments