Hive Supports the SQL Join but only equi-joins are supported, A SQL join is used to combine two or more than two table based on some criteria.the most used join is the Inner Join which return the all the rows between two datasets where join condition met.
lets see in the example how Inner join works in Hive. to demonstrate it we have two data sets the first one is the customers dataset which hold the information of customers.
customers.csv
orders.csv
we have already created the two tables in the hive named as customers and orders which hold the data for the customers and the orders.
customers table
INNER JOIN
The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.
LEFT OUTER JOIN
The LEFT JOIN keyword returns all rows from the left table with the matching rows in the right table. The result is NULL in the right side when there is no match.
RIGHT OUTER JOIN
The RIGHT JOIN keyword returns all rows from the right table , with the matching rows in the left table. The result is NULL in the left side when there is no match.
FULL OUTER JOIN
The FULL OUTER JOIN returns all rows from the left table and from the right table.The FULL OUTER JOIN combines the result of both LEFT and RIGHT joins.
lets see in the example how Inner join works in Hive. to demonstrate it we have two data sets the first one is the customers dataset which hold the information of customers.
customers.csv
cust_id,cust_name,ship_address,phone,email 1001,Sawyer Thompson,"-16.44456 115.92975",1-808-310-6814,faucibus@lacinia.net 1002,Xenos Campos,"5.69702 -164.57551",1-872-151-8210,dolor.Fusce@Nunc.com 1003,Brandon Castro,"-25.12774 -151.86179",1-283-827-7635,parturient@aliquameu.org 1004,Evan Gordon,"-20.12089 -85.3661",1-801-885-3833,Fusce.fermentum@Integereu.ca 1005,Macon Hopper,"22.30371 134.10815",1-354-448-6576,est.congue@acturpisegestas.net 1006,Christian Tucker,"73.86819 114.71156",1-106-627-3799,at.egestas.a@Fuscealiquam.netthe other dataset is orders dataset which hold the information about the order placed by the customers.
orders.csv
orderid,cust_id,item,order_dt,track_id 745651,1009,Cephalexin,08/09/2013,ULQ37MGX7MW 745652,1018,Hydrochlorothiazide,01/01/2015,NJJ84QEM7GO 745653,1027,Sertraline HCl,07/13/2014,UHX76SFB1EP 745654,1036,Simvastatin,01/05/2014,ZZH74UDJ6IC 745655,1045,Lisinopril,04/22/2014,TFV46VBX1ZI 745656,1054,Ibuprofen (Rx),08/22/2015,EHJ68BHA6UU 745657,1063,Suboxone,12/10/2014,BNU38NFJ6FO
we have already created the two tables in the hive named as customers and orders which hold the data for the customers and the orders.
customers table
hive> describe customers; OK custid int customer id custname string customer name address string customer Address phone string customer phone email string customer email Time taken: 0.524 seconds, Fetched: 5 row(s)orders table
hive> describe orders; OK orderid int Order ID fk_cust_id int Cust ID reffering to customers item string Order Item order_dt string Order Date tracking_id string Tracking ID for Order Time taken: 0.732 seconds, Fetched: 5 row(s)
INNER JOIN
The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c inner join orders o on c.custid=o.fk_cust_id > ; Total MapReduce jobs = 1 setting HADOOP_USER_NAME rks Execution log at: /tmp/rks/.log 2015-01-24 05:03:25 Starting to launch local task to process map join; maximum memory = 932118528 2015-01-24 05:03:25 Processing rows: 101 Hashtable size: 101 Memory usage: 8029040 rate: 0.009 2015-01-24 05:03:25 Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-03-19_651_4336759746543942005/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable 2015-01-24 05:03:25 Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-03-19_651_4336759746543942005/-local-10002/HashTable-Stage-3/MapJoin-mapfile01--.hashtable File size: 6249 2015-01-24 05:03:25 End of local task; Time Taken: 0.751 sec. Execution completed successfully Mapred Local Task Succeeded . Convert the Join into MapJoin Mapred Local Task Succeeded . Convert the Join into MapJoin Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201501241609_0017, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0017 Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0017 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2015-01-24 17:03:31,889 Stage-3 map = 0%, reduce = 0% 2015-01-24 17:03:34,902 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.92 sec 2015-01-24 17:03:35,906 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.92 sec 2015-01-24 17:03:36,916 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 0.92 sec MapReduce Total cumulative CPU time: 920 msec Ended Job = job_201501241609_0017 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 0.92 sec HDFS Read: 8569 HDFS Write: 425 SUCCESS Total MapReduce CPU Time Spent: 920 msec OK 1009 Declan Hooper 745651 ULQ37MGX7MW 1018 Nathan Mcintyre 745652 NJJ84QEM7GO 1027 Troy Griffith 745653 UHX76SFB1EP 1036 Clark Frazier 745654 ZZH74UDJ6IC 1045 Tad Cross 745655 TFV46VBX1ZI 1054 Gannon Bradshaw 745656 EHJ68BHA6UU 1063 Walter Figueroa 745657 BNU38NFJ6FO 1072 Brady Mcclure 745658 NBK17XMP9XC 1081 Porter Bowers 745659 XHB61DLY6IK 1090 Jakeem Knight 745660 WNN67FXM2NC 1099 Macaulay Armstrong 745661 VXI39DIZ3HU Time taken: 17.391 seconds, Fetched: 11 row(s) hive>
LEFT OUTER JOIN
The LEFT JOIN keyword returns all rows from the left table with the matching rows in the right table. The result is NULL in the right side when there is no match.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c left outer join orders o on c.custid=o.fk_cust_id; Total MapReduce jobs = 1 setting HADOOP_USER_NAME rks Execution log at: /tmp/rks/.log 2015-01-24 05:08:40 Starting to launch local task to process map join; maximum memory = 932118528 2015-01-24 05:08:41 Processing rows: 101 Hashtable size: 101 Memory usage: 8133752 rate: 0.009 2015-01-24 05:08:41 Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-08-34_361_1900203016678725125/-local-10002/HashTable-Stage-3/MapJoin-mapfile11--.hashtable 2015-01-24 05:08:41 Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-08-34_361_1900203016678725125/-local-10002/HashTable-Stage-3/MapJoin-mapfile11--.hashtable File size: 6249 2015-01-24 05:08:41 End of local task; Time Taken: 0.908 sec. Execution completed successfully Mapred Local Task Succeeded . Convert the Join into MapJoin Mapred Local Task Succeeded . Convert the Join into MapJoin Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201501241609_0018, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0018 Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0018 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2015-01-24 17:08:48,387 Stage-3 map = 0%, reduce = 0% 2015-01-24 17:08:51,396 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.88 sec 2015-01-24 17:08:52,400 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.88 sec 2015-01-24 17:08:53,408 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 0.88 sec MapReduce Total cumulative CPU time: 880 msec Ended Job = job_201501241609_0018 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 0.88 sec HDFS Read: 8569 HDFS Write: 2629 SUCCESS Total MapReduce CPU Time Spent: 880 msec OK NULL cust_name NULL NULL 1001 Sawyer Thompson NULL NULL 1002 Xenos Campos NULL NULL 1003 Brandon Castro NULL NULL 1004 Evan Gordon NULL NULL 1005 Macon Hopper NULL NULL 1006 Christian Tucker NULL NULL 1007 Rafael Erickson NULL NULL 1008 Brent Roth NULL NULL 1009 Declan Hooper 745651 ULQ37MGX7MW 1010 Neil Leon NULL NULL
RIGHT OUTER JOIN
The RIGHT JOIN keyword returns all rows from the right table , with the matching rows in the left table. The result is NULL in the left side when there is no match.
hive> select c.custid,c.custname,o.orderid,o.tracking_id from customers c right outer join orders o on c.custid=o.fk_cust_id; Total MapReduce jobs = 1 setting HADOOP_USER_NAME rks Execution log at: /tmp/rks/.log 2015-01-24 05:10:50 Starting to launch local task to process map join; maximum memory = 932118528 2015-01-24 05:10:51 Processing rows: 101 Hashtable size: 101 Memory usage: 7971568 rate: 0.009 2015-01-24 05:10:51 Dump the hashtable into file: file:/tmp/rks/hive_2015-01-24_17-10-44_697_521683568687053567/-local-10002/HashTable-Stage-3/MapJoin-mapfile20--.hashtable 2015-01-24 05:10:51 Upload 1 File to: file:/tmp/rks/hive_2015-01-24_17-10-44_697_521683568687053567/-local-10002/HashTable-Stage-3/MapJoin-mapfile20--.hashtable File size: 6317 2015-01-24 05:10:51 End of local task; Time Taken: 0.712 sec. Execution completed successfully Mapred Local Task Succeeded . Convert the Join into MapJoin Mapred Local Task Succeeded . Convert the Join into MapJoin Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201501241609_0019, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0019 Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0019 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2015-01-24 17:10:58,019 Stage-3 map = 0%, reduce = 0% 2015-01-24 17:11:01,064 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.91 sec 2015-01-24 17:11:02,067 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.91 sec 2015-01-24 17:11:03,073 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 0.91 sec MapReduce Total cumulative CPU time: 910 msec Ended Job = job_201501241609_0019 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 0.91 sec HDFS Read: 5205 HDFS Write: 2668 SUCCESS Total MapReduce CPU Time Spent: 910 msec OK NULL NULL NULL track_id 1009 Declan Hooper 745651 ULQ37MGX7MW 1018 Nathan Mcintyre 745652 NJJ84QEM7GO 1027 Troy Griffith 745653 UHX76SFB1EP 1036 Clark Frazier 745654 ZZH74UDJ6IC 1045 Tad Cross 745655 TFV46VBX1ZI 1054 Gannon Bradshaw 745656 EHJ68BHA6UU 1063 Walter Figueroa 745657 BNU38NFJ6FO 1072 Brady Mcclure 745658 NBK17XMP9XC 1081 Porter Bowers 745659 XHB61DLY6IK 1090 Jakeem Knight 745660 WNN67FXM2NC 1099 Macaulay Armstrong 745661 VXI39DIZ3HU NULL NULL 745662 DKP00ZCS6FU NULL NULL 745663 YSJ42ZXP5ZG NULL NULL 745664 OBT90SWM3FN NULL NULL 745665 YVJ22BYO5DT NULL NULL 745666 DXY85QAL1BE NULL NULL 745667 THJ12NCF3KR
FULL OUTER JOIN
The FULL OUTER JOIN returns all rows from the left table and from the right table.The FULL OUTER JOIN combines the result of both LEFT and RIGHT joins.
hive> > select c.custid,c.custname,o.orderid,o.tracking_id from customers c full outer join orders o on c.custid=o.fk_cust_id; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201501241609_0020, Tracking URL = http://RKS-PC:50030/jobdetails.jsp?jobid=job_201501241609_0020 Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201501241609_0020 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2015-01-24 17:12:46,443 Stage-1 map = 0%, reduce = 0% 2015-01-24 17:12:50,465 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.08 sec 2015-01-24 17:12:51,470 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.08 sec 2015-01-24 17:12:52,478 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec 2015-01-24 17:12:53,488 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec 2015-01-24 17:12:54,498 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec 2015-01-24 17:12:55,504 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec 2015-01-24 17:12:56,512 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec 2015-01-24 17:12:57,521 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.2 sec 2015-01-24 17:12:58,531 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 2.89 sec 2015-01-24 17:12:59,538 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 2.89 sec 2015-01-24 17:13:00,545 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.85 sec 2015-01-24 17:13:01,551 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.85 sec 2015-01-24 17:13:02,560 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.85 sec MapReduce Total cumulative CPU time: 3 seconds 850 msec Ended Job = job_201501241609_0020 MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 Cumulative CPU: 3.85 sec HDFS Read: 13774 HDFS Write: 4872 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 850 msec OK NULL cust_name NULL NULL NULL NULL NULL track_id 1001 Sawyer Thompson NULL NULL 1002 Xenos Campos NULL NULL 1003 Brandon Castro NULL NULL 1004 Evan Gordon NULL NULL 1005 Macon Hopper NULL NULL 1006 Christian Tucker NULL NULL 1007 Rafael Erickson NULL NULL 1008 Brent Roth NULL NULL 1009 Declan Hooper 745651 ULQ37MGX7MW 1010 Neil Leon NULL NULL 1011 Lionel Vaughan NULL NULL 1012 Dillon Johns NULL NULL 1013 Davis Fisher NULL NULL 1014 Isaac Fields NULL NULL 1015 Micah Figueroa NULL NULL 1016 Burke Merrill NULL NULL 1017 Felix Ward NULL NULL 1018 Nathan Mcintyre 745652 NJJ84QEM7GO 1019 Perry Bullock NULL NULL 1020 Ali Kramer NULL NULL 1021 Timothy Avila NULL NULL 1022 Jason Wolfe NULL NULL
No comments:
Post a Comment