Hive Supports the SQL Join but only equi-joins are supported, A SQL join is used to combine two or more than two table based on some criteria.the most used join is the Inner Join which return the all the rows between two datasets where join condition met.
lets see in the example how Inner join works in Hive. to demonstrate it we have two data sets the first one is the customers dataset which hold the information of customers.
customers.csv
cust_id,cust_name,ship_address,phone,email
1001,Sawyer Thompson,"-16.44456 115.92975",1-808-310-6814,faucibus@lacinia.net
1002,Xenos Campos,"5.69702 -164.57551",1-872-151-8210,dolor.Fusce@Nunc.com
1003,Brandon Castro,"-25.12774 -151.86179",1-283-827-7635,parturient@aliquameu.org
1004,Evan Gordon,"-20.12089 -85.3661",1-801-885-3833,Fusce.fermentum@Integereu.ca
1005,Macon Hopper,"22.30371 134.10815",1-354-448-6576,est.congue@acturpisegestas.net
1006,Christian Tucker,"73.86819 114.71156",1-106-627-3799,at.egestas.a@Fuscealiquam.net
the other dataset is orders dataset which hold the information about the order placed by the customers.
orders.csv
orderid,cust_id,item,order_dt,track_id
745651,1009,Cephalexin,08/09/2013,ULQ37MGX7MW
745652,1018,Hydrochlorothiazide,01/01/2015,NJJ84QEM7GO
745653,1027,Sertraline HCl,07/13/2014,UHX76SFB1EP
745654,1036,Simvastatin,01/05/2014,ZZH74UDJ6IC
745655,1045,Lisinopril,04/22/2014,TFV46VBX1ZI
745656,1054,Ibuprofen (Rx),08/22/2015,EHJ68BHA6UU
745657,1063,Suboxone,12/10/2014,BNU38NFJ6FO
we have already created the two tables in the hive named as customers and orders which hold the data for the customers and the orders.
customers table
hive> describe customers;
OK
custid int customer id
custname string customer name
address string customer Address
phone string customer phone
email string customer email
Time taken: 0.524 seconds, Fetched: 5 row(s)
orders table
hive> describe orders;
OK
orderid int Order ID
fk_cust_id int Cust ID reffering to customers
item string Order Item
order_dt string Order Date
tracking_id string Tracking ID for Order
Time taken: 0.732 seconds, Fetched: 5 row(s)
INNER JOIN
The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c inner join orders o on c.custid=o.fk_cust_id
> ;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME rks
Execution log at: /tmp/rks/.log
2015-01-24 05:03:25 Starting to launch local task to process map join; maximum memory = 932118528
2015-01-24 05:03:25 Processing rows: 101 Hashtable size: 101 Memory usage: 8029040 rate: 0.009
2015-01-24 05:03:25 Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-03-19_651_4336759746543942005/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable
2015-01-24 05:03:25 Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-03-19_651_4336759746543942005/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable File size: 6249
2015-01-24 05:03:25 End of local task; Time Taken: 0.751 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0017, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0017
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0017
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:03:31,889 Stage-3 map = 0%, reduce = 0%
2015-01-24 17:03:34,902 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.92 sec
2015-01-24 17:03:35,906 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.92 sec
2015-01-24 17:03:36,916 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 0.92 sec
MapReduce Total cumulative CPU time: 920 msec
Ended Job = job_201501241609_0017
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.92 sec HDFS Read: 8569 HDFS Write: 425 SUCCESS
Total MapReduce CPU Time Spent: 920 msec
OK
1009 Declan Hooper 745651 ULQ37MGX7MW
1018 Nathan Mcintyre 745652 NJJ84QEM7GO
1027 Troy Griffith 745653 UHX76SFB1EP
1036 Clark Frazier 745654 ZZH74UDJ6IC
1045 Tad Cross 745655 TFV46VBX1ZI
1054 Gannon Bradshaw 745656 EHJ68BHA6UU
1063 Walter Figueroa 745657 BNU38NFJ6FO
1072 Brady Mcclure 745658 NBK17XMP9XC
1081 Porter Bowers 745659 XHB61DLY6IK
1090 Jakeem Knight 745660 WNN67FXM2NC
1099 Macaulay Armstrong 745661 VXI39DIZ3HU
Time taken: 17.391 seconds, Fetched: 11 row(s)
hive>
LEFT OUTER JOIN
The LEFT JOIN keyword returns all rows from the left table with the matching rows in the right table. The result is NULL in the right side when there is no match.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c left outer join orders o on c.custid=o.fk_cust_id;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME rks
Execution log at: /tmp/rks/.log
2015-01-24 05:08:40 Starting to launch local task to process map join; maximum memory = 932118528
2015-01-24 05:08:41 Processing rows: 101 Hashtable size: 101 Memory usage: 8133752 rate: 0.009
2015-01-24 05:08:41 Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-08-34_361_1900203016678725125/-local-10002/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2015-01-24 05:08:41 Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-08-34_361_1900203016678725125/-local-10002/HashTable-Stage-3/MapJoin-mapfile11--.hashtable File size: 6249
2015-01-24 05:08:41 End of local task; Time Taken: 0.908 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0018, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0018
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0018
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:08:48,387 Stage-3 map = 0%, reduce = 0%
2015-01-24 17:08:51,396 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.88 sec
2015-01-24 17:08:52,400 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.88 sec
2015-01-24 17:08:53,408 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 0.88 sec
MapReduce Total cumulative CPU time: 880 msec
Ended Job = job_201501241609_0018
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.88 sec HDFS Read: 8569 HDFS Write: 2629 SUCCESS
Total MapReduce CPU Time Spent: 880 msec
OK
NULL cust_name NULL NULL
1001 Sawyer Thompson NULL NULL
1002 Xenos Campos NULL NULL
1003 Brandon Castro NULL NULL
1004 Evan Gordon NULL NULL
1005 Macon Hopper NULL NULL
1006 Christian Tucker NULL NULL
1007 Rafael Erickson NULL NULL
1008 Brent Roth NULL NULL
1009 Declan Hooper 745651 ULQ37MGX7MW
1010 Neil Leon NULL NULL
RIGHT OUTER JOIN
The RIGHT JOIN keyword returns all rows from the right table , with the matching rows in the left table. The result is NULL in the left side when there is no match.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c right outer join orders o on c.custid=o.fk_cust_id;
Total MapReduce jobs = 1
setting HADOOP_USER_NAME rks
Execution log at: /tmp/rks/.log
2015-01-24 05:10:50 Starting to launch local task to process map join; maximum memory = 932118528
2015-01-24 05:10:51 Processing rows: 101 Hashtable size: 101 Memory usage: 7971568 rate: 0.009
2015-01-24 05:10:51 Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-10-44_697_521683568687053567/-local-10002/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
2015-01-24 05:10:51 Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-10-44_697_521683568687053567/-local-10002/HashTable-Stage-3/MapJoin-mapfile20--.hashtable File size: 6317
2015-01-24 05:10:51 End of local task; Time Taken: 0.712 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501241609_0019, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0019
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0019
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2015-01-24 17:10:58,019 Stage-3 map = 0%, reduce = 0%
2015-01-24 17:11:01,064 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.91 sec
2015-01-24 17:11:02,067 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.91 sec
2015-01-24 17:11:03,073 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 0.91 sec
MapReduce Total cumulative CPU time: 910 msec
Ended Job = job_201501241609_0019
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 0.91 sec HDFS Read: 5205 HDFS Write: 2668 SUCCESS
Total MapReduce CPU Time Spent: 910 msec
OK
NULL NULL NULL track_id
1009 Declan Hooper 745651 ULQ37MGX7MW
1018 Nathan Mcintyre 745652 NJJ84QEM7GO
1027 Troy Griffith 745653 UHX76SFB1EP
1036 Clark Frazier 745654 ZZH74UDJ6IC
1045 Tad Cross 745655 TFV46VBX1ZI
1054 Gannon Bradshaw 745656 EHJ68BHA6UU
1063 Walter Figueroa 745657 BNU38NFJ6FO
1072 Brady Mcclure 745658 NBK17XMP9XC
1081 Porter Bowers 745659 XHB61DLY6IK
1090 Jakeem Knight 745660 WNN67FXM2NC
1099 Macaulay Armstrong 745661 VXI39DIZ3HU
NULL NULL 745662 DKP00ZCS6FU
NULL NULL 745663 YSJ42ZXP5ZG
NULL NULL 745664 OBT90SWM3FN
NULL NULL 745665 YVJ22BYO5DT
NULL NULL 745666 DXY85QAL1BE
NULL NULL 745667 THJ12NCF3KR
FULL OUTER JOIN
The FULL OUTER JOIN returns all rows from the left table and from the right table.The FULL OUTER JOIN combines the result of both LEFT and RIGHT joins.
hive>
> select c.custid,c.custname,o.orderid,o.tracking_id from customers c full outer join orders o on c.custid=o.fk_cust_id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201501241609_0020, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0020
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0020
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2015-01-24 17:12:46,443 Stage-1 map = 0%, reduce = 0%
2015-01-24 17:12:50,465 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.08 sec
2015-01-24 17:12:51,470 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.08 sec
2015-01-24 17:12:52,478 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:53,488 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:54,498 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:55,504 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:56,512 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:57,521 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec
2015-01-24 17:12:58,531 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 2.89 sec
2015-01-24 17:12:59,538 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 2.89 sec
2015-01-24 17:13:00,545 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.85 sec
2015-01-24 17:13:01,551 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.85 sec
2015-01-24 17:13:02,560 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.85 sec
MapReduce Total cumulative CPU time: 3 seconds 850 msec
Ended Job = job_201501241609_0020
MapReduce Jobs Launched:
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 3.85 sec HDFS Read: 13774 HDFS Write: 4872 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 850 msec
OK
NULL cust_name NULL NULL
NULL NULL NULL track_id
1001 Sawyer Thompson NULL NULL
1002 Xenos Campos NULL NULL
1003 Brandon Castro NULL NULL
1004 Evan Gordon NULL NULL
1005 Macon Hopper NULL NULL
1006 Christian Tucker NULL NULL
1007 Rafael Erickson NULL NULL
1008 Brent Roth NULL NULL
1009 Declan Hooper 745651 ULQ37MGX7MW
1010 Neil Leon NULL NULL
1011 Lionel Vaughan NULL NULL
1012 Dillon Johns NULL NULL
1013 Davis Fisher NULL NULL
1014 Isaac Fields NULL NULL
1015 Micah Figueroa NULL NULL
1016 Burke Merrill NULL NULL
1017 Felix Ward NULL NULL
1018 Nathan Mcintyre 745652 NJJ84QEM7GO
1019 Perry Bullock NULL NULL
1020 Ali Kramer NULL NULL
1021 Timothy Avila NULL NULL
1022 Jason Wolfe NULL NULL