In this article, a regular hive query will be compared with UDF which has same purpose. It is especially important to see the effect it will have as a performance.
Sample data which will be used for analytics can be downloaded from here.
The Hive metastore schema for this data;
Create table search_results (
userid string,
unix_time bigint,
hotelresults map<int,struct<advertisers:map<string,array<struct<eurocents:int,breakfast:boolean>>>>>)
Cloudera Quickstart VM 5.13 is used as a test environment.
Use case is divided by 4 steps in order to be make it more understandable.
Task#1: Load the data into Hadoop, and perform a count of the records. List the steps you took to get the data in and to make the count.
Hive Metastore table has been created on default schema using below HQL.
Create table search_results (
userid string,
unix_time bigint,
hotelresults map<int,struct<advertisers:map<string,array<struct<eurocents:int,breakfast:boolean>>>>>);
data.dat file has been copied to Cloudera VM home directory using MobaXTerm FTP. Below script has been used for copying data to hdfs.
hdfs dfs -put /home/cloudera/data.dat /user/hive/warehouse/search_results
Note: Hue UI can be used for copying data to HDFS as well.
Below query has been used for data row counting.
select count(*) from search_results;
Result:1012
Task#2: Execute a query to find per advertiser and hotel the cheapest price that was offered. Provide the query, and the result you got.
Below HQL has been used.
SELECT main.hotel_id,
advs.advertiser,
min(cost.eurocents) AS min_cost
FROM search_results LATERAL VIEW explode(hotelresults) main AS hotel_id,
advertisers LATERAL VIEW explode(advertisers.advertisers) advs AS advertiser,
costs LATERAL VIEW inline(costs) cost AS eurocents,
breakfast
GROUP BY main.hotel_id,
advs.advertiser
ORDER BY main.hotel_id,min_cost;
Results: Time taken: 38.017 seconds, Fetched: 78 row(s)
main.hotel_id | advs.advertiser | min_cost |
---|---|---|
6032 | Mercure | 3804 |
6032 | Amoma | 3804 |
6032 | Destinia | 3814 |
6032 | booking.com | 3852 |
6032 | Tui.com | 3893 |
6032 | expedia | 3924 |
6033 | Destinia | 3808 |
6033 | Mercure | 3810 |
6033 | Amoma | 3815 |
6033 | expedia | 3819 |
6033 | Tui.com | 3840 |
6033 | booking.com | 3845 |
6035 | expedia | 3804 |
6035 | Mercure | 3854 |
6035 | booking.com | 3860 |
6035 | Destinia | 3879 |
6035 | Amoma | 3919 |
6035 | Tui.com | 3970 |
6036 | expedia | 3813 |
6036 | Destinia | 3823 |
6036 | Amoma | 3839 |
6036 | booking.com | 3855 |
6036 | Tui.com | 3884 |
6036 | Mercure | 3965 |
7045 | Destinia | 3805 |
7045 | expedia | 3812 |
7045 | Tui.com | 3839 |
7045 | booking.com | 3847 |
7045 | Amoma | 3847 |
7045 | Mercure | 3879 |
7046 | Tui.com | 3803 |
7046 | Destinia | 3812 |
7046 | expedia | 3814 |
7046 | Mercure | 3836 |
7046 | Amoma | 3857 |
7046 | booking.com | 3961 |
7047 | Amoma | 3805 |
7047 | Destinia | 3823 |
7047 | booking.com | 3832 |
7047 | Tui.com | 3832 |
7047 | Mercure | 3850 |
7047 | expedia | 3945 |
8001 | Destinia | 3800 |
8001 | Amoma | 3808 |
8001 | Mercure | 3829 |
8001 | expedia | 3833 |
8001 | booking.com | 3835 |
8001 | Tui.com | 3892 |
8002 | Destinia | 3802 |
8002 | Tui.com | 3807 |
8002 | booking.com | 3817 |
8002 | Mercure | 3831 |
8002 | Amoma | 3841 |
8002 | expedia | 3869 |
9089 | Amoma | 3804 |
9089 | booking.com | 3814 |
9089 | expedia | 3819 |
9089 | Destinia | 3820 |
9089 | Tui.com | 3859 |
9089 | Mercure | 3863 |
9090 | Mercure | 3802 |
9090 | expedia | 3816 |
9090 | Tui.com | 3837 |
9090 | Destinia | 3839 |
9090 | booking.com | 3884 |
9090 | Amoma | 3927 |
9091 | expedia | 3802 |
9091 | Tui.com | 3819 |
9091 | booking.com | 3846 |
9091 | Amoma | 3853 |
9091 | Destinia | 3871 |
9091 | Mercure | 3875 |
9092 | Destinia | 3807 |
9092 | booking.com | 3807 |
9092 | expedia | 3817 |
9092 | Tui.com | 3824 |
9092 | Amoma | 3893 |
9092 | Mercure | 3910 |
Task#3: For each search generate a list containing the cheapest price per hotel that offers breakfast. Again, please provide the query you used and the result.
SELECT a.userid,
a.unix_time,
collect_list(a.min_cost) AS min_list
FROM
(SELECT rawdata.userid,
rawdata.unix_time,
hotel_results.hotel_id,
min(cost.eurocents) AS min_cost
FROM search_results rawdata LATERAL VIEW explode(hotelresults) hotel_results AS hotel_id,
advertisers LATERAL VIEW explode(advertisers.advertisers) advs AS advertiser,
costs LATERAL VIEW inline(costs) cost AS eurocents,
breakfast
WHERE cost.breakfast=TRUE
GROUP BY rawdata.userid,
rawdata.unix_time,
hotel_results.hotel_id) a
GROUP BY a.userid,
a.unix_time;
Results: 0 results. Time taken: 21.483 seconds In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
Task #4: Generate the list from “task 3” with more efficiency (think about UDFs!). Provide all resources to understand your solution, and measure the difference in execution times.
In order to deserialize “hotelresults” data to map<int,struct<advertisers:map<string,array<structeurocents:int,breakfast:boolean>>>>, Apache Hive SerDe API is used. The executable jar which was achieved after running mvn clean package command has been copied to /user/cloudera/udf HDFS path.
hdfs dfs -put /home/cloudera/data-analysis-1.0-SNAPSHOT-jar-with-dependencies.jar /user/cloudera/udf
Note: Hue UI can be used for copying data to HDFS as well.
In the hive shell, below commands are used for adding and defining developed UDF
add jar hdfs://localhost:8020/user/cloudera/udf/data-analysis-1.0-SNAPSHOT-jar-with-dependencies.jar;
create function fn_min_cost_with_breakfast AS 'com.company.hive.udf.CheapestPricePerHotel';
UDF returns each minimum cost of the hotel, if the breakfast is provided. If it is not provided, returns as null
SELECT userid,
unix_time,
fn_min_cost_with_breakfast(hotelresults) AS min_cost,hotelresults
FROM default.search_results where fn_min_cost_with_breakfast(hotelresults) is not null;
Time taken: 14.755 seconds, Fetched: 0 row(s)
Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there’s no reduce operator Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
Conclusion
In task 3, in order to find breakfast hotel, grouping is used. It creates shuffling for reducer. But, in task 4, using UDF, in collection api level, minimum cost has been tried to be found, and eliminates the shuffling for reducer.
Source code can be download from my github