Quantcast
Channel: Hue, the self service open source Analytics Workbench for browsing, querying and visualizing data interactively
Viewing all articles
Browse latest Browse all 171

Hive and Impala queries life cycle

$
0
0

Hue is used quite a lot for its Hive and Impala SQL editors:

impala-ui

But what happens to the query results? How long are they kept? Why do they disappear sometimes? Why are some Impala queries are still “in flight” even if they are completed?   Each query is using some resources in Impala or HiveServer2. When the users submit a lot of queries, they are going to add up and crash the servers if nothing is done. A lof of improvements came in Hue 3 and CDH4.6 as Hue started to close automatically all the metadata queries. Here are the latest settings that you can tweak:

Impala

Hue tries to close the query when the user navigates away from the result page (as queries are generally fast, it is ok to close them quick). However, if the user never comes back checking the result of the query or never close the page, the query is going to stay.   Starting in Hue 3.7 and C5.2 (with HUE-2251), Impala is going to automatically expire the queries idle for than 10 minutes with the query_timeout_s property.

[impala]
  # If QUERY_TIMEOUT_S > 0, the query will be timed out (i.e. cancelled) if Impala does not do any work (compute or send back results) for that query within QUERY_TIMEOUT_S seconds.
  query_timeout_s=600

Until this version, the only alternative workaround to close all the queries, is to restart Hue (or Impala).

Hive

Hue never closes the Hive queries by default (as some queries can take hours of processing time). Also if your query volume is low (e.g. < a few hundreds a day) and you restart HiveServer2 every week, you are probably not affected. To get the same behavior as Impala (and close the query when the user leaves the page), switch on in the hue.ini:

[beeswax]
  # Hue will try to close the Hive query when the user leaves the editor page.
  #This will free all the query resources in HiveServer2, but also make its results inaccessible.
  close_queries=true

Starting in CDH5 and CDH4.6 (with HiveServer2), some close_query and close_session commands were added to Hue.

build/env/bin/hue close_queries --help

Usage: build/env/bin/hue close_queries [options] <age_in_days>  (default is 7)

Closes the non running queries older than 7 days. If <all> is specified, close the ones of any types.   To run them while using Cloudera Manager, be sure to export these two environment variables:

export HUE_CONF_DIR="/var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'`"

./build/env/bin/hue close_queries 0
Closing (all=False) HiveServer2 queries older than 0 days...
1 queries closed.

./build/env/bin/hue close_sessions 0 hive
Closing (all=False) HiveServer2 sessions older than 0 days...
1 sessions closed.

You can then add this commands into a crontab and expire the queries older than N days.

Note

When using Kerberos you also need:

export HIVE_CONF_DIR="/var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'`/hive-conf"

A cleaner solution comes with HIVE-5799 (available in Hive 0.14 or C5.2). Like Impala, HiveServer2 can now automatically expires queries. So tweak hive-site.xml with:

<property>
  <name>hive.server2.session.check.interval</name>
  <value>3000</value>
  <description>The check interval for session/operation timeout, which can be disabled by setting to zero or negative value.</description>
</property>

<property>
  <name>hive.server2.idle.session.timeout</name>
  <value>0</value>
  <description>Session will be closed when it's not accessed for this duration, which can be disabled by setting to zero or negative value.</description>
</property>

<property>
  <name>hive.server2.idle.operation.timeout</name>
  <value>0</value>
  <description>Operation will be closed when it's not accessed for this duration of time, which can be disabled by setting to zero value. With positive value, it's checked for operations in terminal state only (FINISHED, CANCELED, CLOSED, ERROR). With negative value, it's checked for all of the operations regardless of state.</description>
</property>

Note

This is the recommended solution for Hive. User wishing to keep some result for longer can issue a CREATE TABLE AS SELECT … or export the results in Hue.

Sum-up

The query servers are becoming much more stable with these changes as their resources do not need to grow infinitely. One tradeoff though is that the user will lose his query results after a certain time. To make the experience better, several ideas are being explored, like automatically downloading N rows of the resultset and keeping them for longer.

As usual feel free to comment and send feedback on the hue-user list or @gethue!


Viewing all articles
Browse latest Browse all 171

Trending Articles