Hive UDF Performance Impact Use Case

In this article, a regular hive query will be compared with UDF which has same purpose. It is especially important to see the effect it will have as a performance.

Sample data which will be used for analytics can be downloaded from here.

The Hive metastore schema for this data;

Create table search_results (
userid string,
unix_time bigint,
hotelresults map<int,struct<advertisers:map<string,array<struct<eurocents:int,breakfast:boolean>>>>>)

Cloudera Quickstart VM 5.13 is used as a test environment.

Use case is divided by 4 steps in order to be make it more understandable.

Task#1: Load the data into Hadoop, and perform a count of the records. List the steps you took to get the data in and to make the count.

Hive Metastore table has been created on default schema using below HQL.

Create table search_results (
 userid string,
 unix_time bigint,
 hotelresults map<int,struct<advertisers:map<string,array<struct<eurocents:int,breakfast:boolean>>>>>);

data.dat file has been copied to Cloudera VM home directory using MobaXTerm FTP. Below script has been used for copying data to hdfs.

hdfs dfs -put /home/cloudera/data.dat /user/hive/warehouse/search_results

Note: Hue UI can be used for copying data to HDFS as well.

Below query has been used for data row counting.

select count(*) from search_results;

Result:1012

Task#2: Execute a query to find per advertiser and hotel the cheapest price that was offered. Provide the query, and the result you got.

Below HQL has been used.

SELECT main.hotel_id,
       advs.advertiser,
       min(cost.eurocents) AS min_cost
FROM search_results LATERAL VIEW explode(hotelresults) main AS hotel_id,
                                 advertisers LATERAL VIEW explode(advertisers.advertisers) advs AS advertiser,
                                                          costs LATERAL VIEW inline(costs) cost AS eurocents,
                                                                             breakfast
GROUP BY main.hotel_id,
         advs.advertiser
ORDER BY main.hotel_id,min_cost;

Results: Time taken: 38.017 seconds, Fetched: 78 row(s)

main.hotel_idadvs.advertisermin_cost
6032Mercure3804
6032Amoma3804
6032Destinia3814
6032booking.com3852
6032Tui.com3893
6032expedia3924
6033Destinia3808
6033Mercure3810
6033Amoma3815
6033expedia3819
6033Tui.com3840
6033booking.com3845
6035expedia3804
6035Mercure3854
6035booking.com3860
6035Destinia3879
6035Amoma3919
6035Tui.com3970
6036expedia3813
6036Destinia3823
6036Amoma3839
6036booking.com3855
6036Tui.com3884
6036Mercure3965
7045Destinia3805
7045expedia3812
7045Tui.com3839
7045booking.com3847
7045Amoma3847
7045Mercure3879
7046Tui.com3803
7046Destinia3812
7046expedia3814
7046Mercure3836
7046Amoma3857
7046booking.com3961
7047Amoma3805
7047Destinia3823
7047booking.com3832
7047Tui.com3832
7047Mercure3850
7047expedia3945
8001Destinia3800
8001Amoma3808
8001Mercure3829
8001expedia3833
8001booking.com3835
8001Tui.com3892
8002Destinia3802
8002Tui.com3807
8002booking.com3817
8002Mercure3831
8002Amoma3841
8002expedia3869
9089Amoma3804
9089booking.com3814
9089expedia3819
9089Destinia3820
9089Tui.com3859
9089Mercure3863
9090Mercure3802
9090expedia3816
9090Tui.com3837
9090Destinia3839
9090booking.com3884
9090Amoma3927
9091expedia3802
9091Tui.com3819
9091booking.com3846
9091Amoma3853
9091Destinia3871
9091Mercure3875
9092Destinia3807
9092booking.com3807
9092expedia3817
9092Tui.com3824
9092Amoma3893
9092Mercure3910
Table-1. Hotel list having breakfast

Task#3: For each search generate a list containing the cheapest price per hotel that offers breakfast. Again, please provide the query you used and the result.

SELECT a.userid,
       a.unix_time,
       collect_list(a.min_cost) AS min_list
FROM
  (SELECT rawdata.userid,
          rawdata.unix_time,
          hotel_results.hotel_id,
          min(cost.eurocents) AS min_cost
   FROM search_results rawdata LATERAL VIEW explode(hotelresults) hotel_results AS hotel_id,
                                            advertisers LATERAL VIEW explode(advertisers.advertisers) advs AS advertiser,
                                                                     costs LATERAL VIEW inline(costs) cost AS eurocents,
                                                                                        breakfast
   WHERE cost.breakfast=TRUE
   GROUP BY rawdata.userid,
            rawdata.unix_time,
            hotel_results.hotel_id) a
GROUP BY a.userid,
         a.unix_time;

Results: 0 results. Time taken: 21.483 seconds In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

Task #4: Generate the list from “task 3” with more efficiency (think about UDFs!). Provide all resources to understand your solution, and measure the difference in execution times.

In order to deserialize “hotelresults” data to map<int,struct<advertisers:map<string,array<structeurocents:int,breakfast:boolean>>>>, Apache Hive SerDe API is used. The executable jar which was achieved after running mvn clean package command has been copied to /user/cloudera/udf HDFS path.

hdfs dfs -put /home/cloudera/data-analysis-1.0-SNAPSHOT-jar-with-dependencies.jar /user/cloudera/udf

Note: Hue UI can be used for copying data to HDFS as well.

In the hive shell, below commands are used for adding and defining developed UDF

add jar hdfs://localhost:8020/user/cloudera/udf/data-analysis-1.0-SNAPSHOT-jar-with-dependencies.jar;
create function fn_min_cost_with_breakfast AS 'com.company.hive.udf.CheapestPricePerHotel';

UDF returns each minimum cost of the hotel, if the breakfast is provided. If it is not provided, returns as null

SELECT userid,
          unix_time,
          fn_min_cost_with_breakfast(hotelresults) AS min_cost,hotelresults
   FROM default.search_results where fn_min_cost_with_breakfast(hotelresults) is not null;

Time taken: 14.755 seconds, Fetched: 0 row(s)

Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there’s no reduce operator Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

Conclusion

In task 3, in order to find breakfast hotel, grouping is used. It creates shuffling for reducer. But, in task 4, using UDF, in collection api level, minimum cost has been tried to be found, and eliminates the shuffling for reducer.

Source code can be download from my github