Saturday, January 24, 2015

Apache Hive : Joining datasets

Hive Supports the SQL Join but only equi-joins are supported, A SQL join is used to combine two or more than two table based on some criteria.the most used join is the Inner Join which return the all the rows between two datasets where join condition met.
lets see in the example how Inner join works in Hive. to demonstrate it we have two data sets the first one is the customers dataset which hold the information of customers.
customers.csv
cust_id,cust_name,ship_address,phone,email
1001,Sawyer Thompson,"-16.44456  115.92975",1-808-310-6814,faucibus@lacinia.net
1002,Xenos Campos,"5.69702  -164.57551",1-872-151-8210,dolor.Fusce@Nunc.com
1003,Brandon Castro,"-25.12774  -151.86179",1-283-827-7635,parturient@aliquameu.org
1004,Evan Gordon,"-20.12089  -85.3661",1-801-885-3833,Fusce.fermentum@Integereu.ca
1005,Macon Hopper,"22.30371  134.10815",1-354-448-6576,est.congue@acturpisegestas.net
1006,Christian Tucker,"73.86819  114.71156",1-106-627-3799,at.egestas.a@Fuscealiquam.net
the other dataset is orders dataset which hold the information about the order placed by the customers.
orders.csv
orderid,cust_id,item,order_dt,track_id
745651,1009,Cephalexin,08/09/2013,ULQ37MGX7MW
745652,1018,Hydrochlorothiazide,01/01/2015,NJJ84QEM7GO
745653,1027,Sertraline HCl,07/13/2014,UHX76SFB1EP
745654,1036,Simvastatin,01/05/2014,ZZH74UDJ6IC
745655,1045,Lisinopril,04/22/2014,TFV46VBX1ZI
745656,1054,Ibuprofen (Rx),08/22/2015,EHJ68BHA6UU
745657,1063,Suboxone,12/10/2014,BNU38NFJ6FO

we have already created the two tables in the hive named as customers and orders which hold the data for the customers and the orders.
customers table
hive> describe customers;
OK
custid              	int                 	customer id         
custname            	string              	customer name       
address             	string              	customer Address    
phone               	string              	customer phone      
email               	string              	customer email      
Time taken: 0.524 seconds, Fetched: 5 row(s)
orders table
hive> describe orders; 
OK
orderid             	int                 	Order ID            
fk_cust_id          	int                 	Cust ID reffering to customers
item                	string              	Order Item          
order_dt            	string              	Order Date          
tracking_id         	string              	Tracking ID for Order
Time taken: 0.732 seconds, Fetched: 5 row(s)

INNER JOIN
The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c inner join orders o on c.custid=o.fk_cust_id
> ;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME	rks
Execution log at: /tmp/rks/.log
2015-01-24 05:03:25	Starting to launch local task to process map join;	maximum memory = 932118528
2015-01-24 05:03:25	Processing rows:	101	Hashtable size:	101	Memory usage:	8029040	rate:	0.009
2015-01-24 05:03:25	Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-03-19_651_4336759746543942005/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable
2015-01-24 05:03:25	Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-03-19_651_4336759746543942005/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable File size: 6249
2015-01-24 05:03:25	End of local task; Time Taken: 0.751 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0017, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0017
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -kill job_201501241609_0017
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:03:31,889 Stage-3 map = 0%,  reduce = 0%
2015-01-24 17:03:34,902 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.92 sec
2015-01-24 17:03:35,906 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.92 sec
2015-01-24 17:03:36,916 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 0.92 sec
MapReduce Total cumulative CPU time: 920 msec
Ended Job = job_201501241609_0017
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.92 sec   HDFS Read: 8569 HDFS Write: 425 SUCCESS
Total MapReduce CPU Time Spent: 920 msec
OK
1009	Declan Hooper	745651	ULQ37MGX7MW
1018	Nathan Mcintyre	745652	NJJ84QEM7GO
1027	Troy Griffith	745653	UHX76SFB1EP
1036	Clark Frazier	745654	ZZH74UDJ6IC
1045	Tad Cross	745655	TFV46VBX1ZI
1054	Gannon Bradshaw	745656	EHJ68BHA6UU
1063	Walter Figueroa	745657	BNU38NFJ6FO
1072	Brady Mcclure	745658	NBK17XMP9XC
1081	Porter Bowers	745659	XHB61DLY6IK
1090	Jakeem Knight	745660	WNN67FXM2NC
1099	Macaulay Armstrong	745661	VXI39DIZ3HU
Time taken: 17.391 seconds, Fetched: 11 row(s)
hive>

LEFT OUTER JOIN
The LEFT JOIN keyword returns all rows from the left table with the matching rows in the right table. The result is NULL in the right side when there is no match.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c left outer join orders o on c.custid=o.fk_cust_id;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME	rks
Execution log at: /tmp/rks/.log
2015-01-24 05:08:40	Starting to launch local task to process map join;	maximum memory = 932118528
2015-01-24 05:08:41	Processing rows:	101	Hashtable size:	101	Memory usage:	8133752	rate:	0.009
2015-01-24 05:08:41	Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-08-34_361_1900203016678725125/-local-10002/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2015-01-24 05:08:41	Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-08-34_361_1900203016678725125/-local-10002/HashTable-Stage-3/MapJoin-mapfile11--.hashtable File size: 6249
2015-01-24 05:08:41	End of local task; Time Taken: 0.908 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0018, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0018
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -kill job_201501241609_0018
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:08:48,387 Stage-3 map = 0%,  reduce = 0%
2015-01-24 17:08:51,396 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.88 sec
2015-01-24 17:08:52,400 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.88 sec
2015-01-24 17:08:53,408 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 0.88 sec
MapReduce Total cumulative CPU time: 880 msec
Ended Job = job_201501241609_0018
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.88 sec   HDFS Read: 8569 HDFS Write: 2629 SUCCESS
Total MapReduce CPU Time Spent: 880 msec
OK
NULL	cust_name	NULL	NULL
1001	Sawyer Thompson	NULL	NULL
1002	Xenos Campos	NULL	NULL
1003	Brandon Castro	NULL	NULL
1004	Evan Gordon	NULL	NULL
1005	Macon Hopper	NULL	NULL
1006	Christian Tucker	NULL	NULL
1007	Rafael Erickson	NULL	NULL
1008	Brent Roth	NULL	NULL
1009	Declan Hooper	745651	ULQ37MGX7MW
1010	Neil Leon	NULL	NULL

RIGHT OUTER JOIN
The RIGHT JOIN keyword returns all rows from the right table , with the matching rows in the left table. The result is NULL in the left side when there is no match.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c right outer join orders o on c.custid=o.fk_cust_id;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME	rks
Execution log at: /tmp/rks/.log
2015-01-24 05:10:50	Starting to launch local task to process map join;	maximum memory = 932118528
2015-01-24 05:10:51	Processing rows:	101	Hashtable size:	101	Memory usage:	7971568	rate:	0.009
2015-01-24 05:10:51	Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-10-44_697_521683568687053567/-local-10002/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
2015-01-24 05:10:51	Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-10-44_697_521683568687053567/-local-10002/HashTable-Stage-3/MapJoin-mapfile20--.hashtable File size: 6317
2015-01-24 05:10:51	End of local task; Time Taken: 0.712 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0019, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0019
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -kill job_201501241609_0019
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:10:58,019 Stage-3 map = 0%,  reduce = 0%
2015-01-24 17:11:01,064 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.91 sec
2015-01-24 17:11:02,067 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 0.91 sec
2015-01-24 17:11:03,073 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 0.91 sec
MapReduce Total cumulative CPU time: 910 msec
Ended Job = job_201501241609_0019
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.91 sec   HDFS Read: 5205 HDFS Write: 2668 SUCCESS
Total MapReduce CPU Time Spent: 910 msec
OK
NULL	NULL	NULL	track_id
1009	Declan Hooper	745651	ULQ37MGX7MW
1018	Nathan Mcintyre	745652	NJJ84QEM7GO
1027	Troy Griffith	745653	UHX76SFB1EP
1036	Clark Frazier	745654	ZZH74UDJ6IC
1045	Tad Cross	745655	TFV46VBX1ZI
1054	Gannon Bradshaw	745656	EHJ68BHA6UU
1063	Walter Figueroa	745657	BNU38NFJ6FO
1072	Brady Mcclure	745658	NBK17XMP9XC
1081	Porter Bowers	745659	XHB61DLY6IK
1090	Jakeem Knight	745660	WNN67FXM2NC
1099	Macaulay Armstrong	745661	VXI39DIZ3HU
NULL	NULL	745662	DKP00ZCS6FU
NULL	NULL	745663	YSJ42ZXP5ZG
NULL	NULL	745664	OBT90SWM3FN
NULL	NULL	745665	YVJ22BYO5DT
NULL	NULL	745666	DXY85QAL1BE
NULL	NULL	745667	THJ12NCF3KR

FULL OUTER JOIN
The FULL OUTER JOIN returns all rows from the left table and from the right table.The FULL OUTER JOIN combines the result of both LEFT and RIGHT joins.
hive> 
> select c.custid,c.custname,o.orderid,o.tracking_id from customers c full outer join orders o on c.custid=o.fk_cust_id; 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201501241609_0020, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0020
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job  -kill job_201501241609_0020
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2015-01-24 17:12:46,443 Stage-1 map = 0%,  reduce = 0%
2015-01-24 17:12:50,465 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.08 sec
2015-01-24 17:12:51,470 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.08 sec
2015-01-24 17:12:52,478 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:53,488 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:54,498 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:55,504 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:56,512 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:57,521 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:58,531 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 2.89 sec
2015-01-24 17:12:59,538 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 2.89 sec
2015-01-24 17:13:00,545 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.85 sec
2015-01-24 17:13:01,551 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.85 sec
2015-01-24 17:13:02,560 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.85 sec
MapReduce Total cumulative CPU time: 3 seconds 850 msec
Ended Job = job_201501241609_0020
MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 3.85 sec   HDFS Read: 13774 HDFS Write: 4872 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 850 msec
OK
NULL	cust_name	NULL	NULL
NULL	NULL	NULL	track_id
1001	Sawyer Thompson	NULL	NULL
1002	Xenos Campos	NULL	NULL
1003	Brandon Castro	NULL	NULL
1004	Evan Gordon	NULL	NULL
1005	Macon Hopper	NULL	NULL
1006	Christian Tucker	NULL	NULL
1007	Rafael Erickson	NULL	NULL
1008	Brent Roth	NULL	NULL
1009	Declan Hooper	745651	ULQ37MGX7MW
1010	Neil Leon	NULL	NULL
1011	Lionel Vaughan	NULL	NULL
1012	Dillon Johns	NULL	NULL
1013	Davis Fisher	NULL	NULL
1014	Isaac Fields	NULL	NULL
1015	Micah Figueroa	NULL	NULL
1016	Burke Merrill	NULL	NULL
1017	Felix Ward	NULL	NULL
1018	Nathan Mcintyre	745652	NJJ84QEM7GO
1019	Perry Bullock	NULL	NULL
1020	Ali Kramer	NULL	NULL
1021	Timothy Avila	NULL	NULL
1022	Jason Wolfe	NULL	NULL

No comments: