java.lang.ArrayIndexOutOfBoundsException without stack trace

Recently I got across a java.lang.ArrayIndexOutOfBoundsException, but could not find any stack trace. So it was difficult to find where the Exception was thrown.
The “problem” was the OmitStackTraceInFastThrow JVM option which is set by default for performance reasons.

It can be disabled with this JVM option:

-XX:-OmitStackTraceInFastThrow

After that you get a regular stack trace an can find the problem.
The best solution seems to be to avoid the ArrayIndexOutOfBoundsException by proper coding, so if the exception is not thrown at all there should be no performance impact.

Hortonworks Sandbox Tez: java.lang.OutOfMemoryError: Java heap space

Since the tutorial on Microsoft azure about the flight delays was running a bit slow I tried to run it on a Hortonworks Sandbox (on a Notebook with 16GB memory).
I used the data from 2013, the full year.

Raw Row Count:

select count(*) from delays;
6369494

But running this statement caused an OutOfMemory error:

create table delays_wather
as SELECT regexp_replace(origin_city_name, '''', ''),
        avg(weather_delay)
    FROM delays
    WHERE weather_delay IS NOT NULL GROUP BY origin_city_name;

java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1476381439440_0001_1_00, diagnostics=[Task failed, taskId=task_1476381439440_0001_1_00_000005, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:159) at
….

Changing some memory Options prevented the OutOfMemory, but the statement would never finish.

After some research (where this link was very helpflul) I got the statement running in less than 16 seconds.
Some memory parameters seem to be a bit strange on the Hortonworks Sandbox 2.5. (Docker Version)

How I changed the configuration:
Yarn:

  • Memory allocated for all YARN containers = 4G
  • Maximum Container Size (Memory) = 2G
  • Minimum Container Size (Memory) = 250M

Tez:

  • tez.am.resource.memory.mb = 800M # if to low it will be slow, but need to be lower than yarn max. container size
  • tez.task.resource.memory.mb = 1G
  • tez.runtime.io.sort.mb = 270M
  • tez.runtime.unordered.output.buffer.size-mb = 76M
  • tez.task.launch.cmd-opts = -Xmx624m # 80% of tez.task.resource.memory.mb

Hive:

  • HiveServer2 Heap Size = 6G # it was to 96G on default
  • Metastore Heap Size = 2G # was on 32G on default
  • For Map Join, per Map memory threshold = 270M # just limit to a reasonable amount
  • Tez Container Size = 1G
  • For Map Join, per Map memory threshold = 800M # 200M is too low, 80% of Tez Container

That made the Notebook (Dell XPS 7i, 16RAM, Ubunut 16.04) beat the azure cloud solution, which needed 41 seconds.

Any suggestions and explanations about the memory settings are welcome.