Since the tutorial on Microsoft azure about the flight delays was running a bit slow I tried to run it on a Hortonworks Sandbox (on a Notebook with 16GB memory).
I used the data from 2013, the full year.
Raw Row Count:
select count(*) from delays; 6369494
But running this statement caused an OutOfMemory error:
create table delays_wather as SELECT regexp_replace(origin_city_name, '''', ''), avg(weather_delay) FROM delays WHERE weather_delay IS NOT NULL GROUP BY origin_city_name;
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1476381439440_0001_1_00, diagnostics=[Task failed, taskId=task_1476381439440_0001_1_00_000005, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:159) at
Changing some memory Options prevented the OutOfMemory, but the statement would never finish.
After some research (where this link was very helpflul) I got the statement running in less than 16 seconds.
Some memory parameters seem to be a bit strange on the Hortonworks Sandbox 2.5. (Docker Version)
How I changed the configuration:
- Memory allocated for all YARN containers = 4G
- Maximum Container Size (Memory) = 2G
- Minimum Container Size (Memory) = 250M
- tez.am.resource.memory.mb = 800M # if to low it will be slow, but need to be lower than yarn max. container size
- tez.task.resource.memory.mb = 1G
- tez.runtime.io.sort.mb = 270M
- tez.runtime.unordered.output.buffer.size-mb = 76M
- tez.task.launch.cmd-opts = -Xmx624m # 80% of tez.task.resource.memory.mb
- HiveServer2 Heap Size = 6G # it was to 96G on default
- Metastore Heap Size = 2G # was on 32G on default
- For Map Join, per Map memory threshold = 270M # just limit to a reasonable amount
- Tez Container Size = 1G
- For Map Join, per Map memory threshold = 800M # 200M is too low, 80% of Tez Container
That made the Notebook (Dell XPS 7i, 16RAM, Ubunut 16.04) beat the azure cloud solution, which needed 41 seconds.
Any suggestions and explanations about the memory settings are welcome.