Vertica installation on Ubuntu 16.04 LTS

Vertica currently supported on Ubuntu 14.04 LTS but not yet on 16.04.

But with a view hacks it will install on Ubuntu 16.04 LTS as well.

Install required packages

apt-get install mcelog dialog

Fake Debian version:<pre>

cp /etc/debian_version /etc/debian_version.org
echo "jessie/sid" > /etc/debian_version

Install Vertica

/opt/vertica/sbin/install_vertica --hosts 127.0.0.1 --failure-threshold NONE

Configure Vertica

sudo su - dbadmin
/opt/vertica/bin/adminTools

Have fun with Vertica Community Edition on Ubunut 16.04

Hortonworks Sandbox Tez: java.lang.OutOfMemoryError: Java heap space

Since the tutorial on Microsoft azure about the flight delays was running a bit slow I tried to run it on a Hortonworks Sandbox (on a Notebook with 16GB memory).
I used the data from 2013, the full year.

Raw Row Count:

select count(*) from delays;
6369494

But running this statement caused an OutOfMemory error:

create table delays_wather
as SELECT regexp_replace(origin_city_name, '''', ''),
        avg(weather_delay)
    FROM delays
    WHERE weather_delay IS NOT NULL GROUP BY origin_city_name;

java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1476381439440_0001_1_00, diagnostics=[Task failed, taskId=task_1476381439440_0001_1_00_000005, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:159) at
….

Changing some memory Options prevented the OutOfMemory, but the statement would never finish.

After some research (where this link was very helpflul) I got the statement running in less than 16 seconds.
Some memory parameters seem to be a bit strange on the Hortonworks Sandbox 2.5. (Docker Version)

How I changed the configuration:
Yarn:

  • Memory allocated for all YARN containers = 4G
  • Maximum Container Size (Memory) = 2G
  • Minimum Container Size (Memory) = 250M

Tez:

  • tez.am.resource.memory.mb = 800M # if to low it will be slow, but need to be lower than yarn max. container size
  • tez.task.resource.memory.mb = 1G
  • tez.runtime.io.sort.mb = 270M
  • tez.runtime.unordered.output.buffer.size-mb = 76M
  • tez.task.launch.cmd-opts = -Xmx624m # 80% of tez.task.resource.memory.mb

Hive:

  • HiveServer2 Heap Size = 6G # it was to 96G on default
  • Metastore Heap Size = 2G # was on 32G on default
  • For Map Join, per Map memory threshold = 270M # just limit to a reasonable amount
  • Tez Container Size = 1G
  • For Map Join, per Map memory threshold = 800M # 200M is too low, 80% of Tez Container

That made the Notebook (Dell XPS 7i, 16RAM, Ubunut 16.04) beat the azure cloud solution, which needed 41 seconds.

Any suggestions and explanations about the memory settings are welcome.