Kerberos security and Sentry authorization for Solr Search App

September 17, 2014, 4:56 am

≫ Next: Hive and Impala queries life cycle

This blog post details how to use Kerberos and Sentry in the Hue Search Application. If you only want to use Kerberos, just skip the paragraphs about Sentry.

Kerberos enables you to authenticate users in your Hadoop cluster. For example, it guarantees that it is really the user ‘bob’ and not ‘joe’ that is submitting a job, listing files or doing a search. Next step is configuring what the user can access, this is called authorization. Sentry is the secure way to define who can see, query, add data in the Solr collections/indexes. This is only possible as we guarantee the usernames performing the actions with Kerberos.

Hue comes with a set of collections and examples ready to be installed. However, with Kerberos, this requires a bit more than just one click.

First, make sure that you have a kerberized Cluster (and it particular Solr Search for Hue) with Sentry configured.

Make sure you use the secure version of solrconfig.xml:

solrctl instancedir --generate foosecure
cp foosecure/conf/solrconfig.xml.secure solr_configs_twitter_demo/conf/solrconfig.xml
solrctl instancedir --update twitter_demo solr_configs_twitter_demo
solrctl collection --reload twitter_demo

Then, create the collection. The command should work as-is if you have the proper Solr environment variables.

cd $HUE_HOME/apps/search/examples/bin

./create_collections.sh

You should then see the collections:

solrctl instancedir --list
jobs_demo
log_analytics_demo
twitter_demo
yelp_demo

The next step is to create the Solr cores. To keep it simple, we will just use one collection, the twitter demo. When creating the core

sudo -u systest solrctl collection --create twitter_demo -s 1

if using Sentry, you will probably see this error the first time:

Error: A call to SolrCloud WEB APIs failed: HTTP/1.1 401 Unauthorized
Server: Apache-Coyote/1.1
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=; Version=1; Path=/; Expires=Thu, 01-Jan-1970 00:00:00 GMT; HttpOnly
Content-Type: text/html;charset=utf-8
Content-Length: 997
Date: Thu, 11 Sep 2014 16:32:17 GMT</pre>
<pre>HTTP/1.1 401 Unauthorized
Server: Apache-Coyote/1.1
WWW-Authenticate: Negotiate YGwGCSqGSIb3EgECAgIAb10wW6ADAgEFoQMCAQ+iTzBNoAMCARCiRgRE62zOpPwr+KLoFKdUX2I6FtbN0DyxSA5a8n4BSZRJMTf413TEXzJbVh3/G7jWiMasIIzeETrd0Bv8suBsuKS/HdqG068=
Set-Cookie: hadoop.auth="u=systest&p=systest@ENT.CLOUDERA.COM&t=kerberos&e=1410489137684&s=qAkcQr4ZPBkn5Ewg/Ugz/CqgLkU="; Version=1; Path=/; Expires=Fri, 12-Sep-2014 02:32:17 GMT; HttpOnly
Content-Type: application/xml;charset=UTF-8
Transfer-Encoding: chunked
Date: Thu, 11 Sep 2014 16:32:17 GMT

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">
401</int>
<int name="QTime">
18</int>
</lst>
<lst name="error">
<str name="msg">
org.apache.sentry.binding.solr.authz.SentrySolrAuthorizationException: User systest does not have privileges for admin</str>
<int name="code">
401</int>
</lst>

This is because by default our ‘systest’ user does not have permissions to create the core. ‘systest’ belongs to the ‘admin’ Unix/LDAP group and we need to create a Sentry group that includes the privileges named ‘admin’. Our ‘systest’ user needs to belongs to the group that contains this role.

In order to do this, we need to update:

/user/solr/sentry/sentry-provider.ini

with something similar to this:

[groups]
admin = admin_role
analyst = query_role

[roles]
admin_role = collection=admin->action=*, collection=twitter_demo->action=*
query_role = collection=twitter_demo->action=query

‘systest’ belongs to the LDAP ‘admin’ group. ‘admin’ is assigned the ‘eng_role’ role with the ‘admin’ privilege. Regular analyst users belong to the LDAP ‘analyst’ group that contains the Sentry ‘read_only’ role and its ‘query’ permission for the twitter collection. Here is the list of available permissions.

Note

The upcoming Hue 3.7 has a new Sentry App that lets you forget about sentry-provider.ini and enables you to configure the above in a Web UI. Moreover, Solr Sentry support we be integrated in Hue as soon as its API becomes available.

Then it is time to create the core and upload some data. Update the post.sh command to make it work with Kerberos.

Replace ‘curl’ by:

curl --negotiate -u: foo:bar

and make sure that you use the real hostname in the URL:

URL=http://hue-c5-sentry.ent.cloudera.com:8983/solr

A quick way to test is is to run the indexing command:

sudo -u systest curl --negotiate -u: foo:bar http://hue-c5-sentry.ent.cloudera.com:8983/solr/twitter_demo/update --data-binary @../collections/solr_configs_twitter_demo/index_data.csv -H 'Content-type:text/csv'

And that’s it! The collection with its data will appear into Solr and Hue. Depending on its group, the user can or cannot modify the collection.

Your organization can now leverage the exploration capacity of the Search app with fine grained security! Next versions will come up with field level security and a nice UI for configuring it (no more sentry-provider.ini :).

As usual feel free to comment on the hue-user list or @gethue!

↧

Hive and Impala queries life cycle

September 17, 2014, 2:58 pm

≫ Next: SSL Encryption between Hue and Hive

≪ Previous: Kerberos security and Sentry authorization for Solr Search App

Hue is used quite a lot for its Hive and Impala SQL editors:

But what happens to the query results? How long are they kept? Why do they disappear sometimes? Why are some Impala queries are still “in flight” even if they are completed? Each query is using some resources in Impala or HiveServer2. When the users submit a lot of queries, they are going to add up and crash the servers if nothing is done. A lof of improvements came in Hue 3 and CDH4.6 as Hue started to close automatically all the metadata queries. Here are the latest settings that you can tweak:

Impala

Hue tries to close the query when the user navigates away from the result page (as queries are generally fast, it is ok to close them quick). However, if the user never comes back checking the result of the query or never close the page, the query is going to stay. Starting in Hue 3.7 and C5.2 (with HUE-2251), Impala is going to automatically expire the queries idle for than 10 minutes with the query_timeout_s property.

[impala]
  # If QUERY_TIMEOUT_S > 0, the query will be timed out (i.e. cancelled) if Impala does not do any work (compute or send back results) for that query within QUERY_TIMEOUT_S seconds.
  query_timeout_s=600

Until this version, the only alternative workaround to close all the queries, is to restart Hue (or Impala).

Hive

Hue never closes the Hive queries by default (as some queries can take hours of processing time). Also if your query volume is low (e.g. < a few hundreds a day) and you restart HiveServer2 every week, you are probably not affected. To get the same behavior as Impala (and close the query when the user leaves the page), switch on in the hue.ini:

[beeswax]
  # Hue will try to close the Hive query when the user leaves the editor page.
  #This will free all the query resources in HiveServer2, but also make its results inaccessible.
  close_queries=true

Starting in CDH5 and CDH4.6 (with HiveServer2), some close_query and close_session commands were added to Hue.

build/env/bin/hue close_queries --help

Usage: build/env/bin/hue close_queries [options] <age_in_days>  (default is 7)

Closes the non running queries older than 7 days. If <all> is specified, close the ones of any types. To run them while using Cloudera Manager, be sure to export these two environment variables:

export HUE_CONF_DIR="/var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'`"

./build/env/bin/hue close_queries 0
Closing (all=False) HiveServer2 queries older than 0 days...
1 queries closed.

./build/env/bin/hue close_sessions 0 hive
Closing (all=False) HiveServer2 sessions older than 0 days...
1 sessions closed.

You can then add this commands into a crontab and expire the queries older than N days.

Note

When using Kerberos you also need:

export HIVE_CONF_DIR="/var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'`/hive-conf"

A cleaner solution comes with HIVE-5799 (available in Hive 0.14 or C5.2). Like Impala, HiveServer2 can now automatically expires queries. So tweak hive-site.xml with:

<property>
  <name>hive.server2.session.check.interval</name>
  <value>3000</value>
  <description>The check interval for session/operation timeout, which can be disabled by setting to zero or negative value.</description>
</property>

<property>
  <name>hive.server2.idle.session.timeout</name>
  <value>0</value>
  <description>Session will be closed when it's not accessed for this duration, which can be disabled by setting to zero or negative value.</description>
</property>

<property>
  <name>hive.server2.idle.operation.timeout</name>
  <value>0</value>
  <description>Operation will be closed when it's not accessed for this duration of time, which can be disabled by setting to zero value. With positive value, it's checked for operations in terminal state only (FINISHED, CANCELED, CLOSED, ERROR). With negative value, it's checked for all of the operations regardless of state.</description>
</property>

Note

This is the recommended solution for Hive. User wishing to keep some result for longer can issue a CREATE TABLE AS SELECT … or export the results in Hue.

Sum-up

The query servers are becoming much more stable with these changes as their resources do not need to grow infinitely. One tradeoff though is that the user will lose his query results after a certain time. To make the experience better, several ideas are being explored, like automatically downloading N rows of the resultset and keeping them for longer.

As usual feel free to comment and send feedback on the hue-user list or @gethue!

↧

SSL Encryption between Hue and Hive

September 22, 2014, 11:20 am

≫ Next: How to configure Hue for your Hadoop cluster

≪ Previous: Hive and Impala queries life cycle

This blog post was originally published on the MapR blog.

While big data security analytics promises to deliver great insights in the battle against cyber threats, the concept and the tools are still maturing. In this blog, I’ll simplify the topic of adopting security in Hadoop by showing you how to encrypt traffic between Hue and Hive.

Hue can communicate with Hive over a channel encrypted with SSL. Let’s take a look at the interface and the handshake mechanism first before trying to secure it.

The basic high-level idea concept behind the SSL protocol handshake mechanism is shown in the diagram shown below, where Hue is the SSL Client, and Hive is the SSL Server:

a. SSL Client (Hue) opens a socket connection and connects to Hive. This is then encapsulated with a wrapper that encrypts and decrypts the data going over the socket with SSL.

b. Once Hive receives an incoming connection, it shows a certificate to Hue (which is like a public key saying it can be trusted).

c. Hue can then verify the authenticity of this certificate with a trusted certificate-issuing authority, or it can be skipped for self-signed certificates.

d. Hue encrypts messages using this public key and sends data to Hive.

e. Hive decrypts the message with its private key.

The public/private keys always come in pairs and are used to encrypt/decrypt messages. These can be generated with the UNIX keytool command-line utility which is understood by the Java keystore library, or with the UNIX OpenSSL utility which is understood directly by the Python SSL library.

The Hive-side uses Java keystore certificates and public/private keys and Hue’s Python code calls the SSL library implemented in C. Much of the complication arises in not having one uniform format which can be understood by all languages—Python, Java and C. For example, the SSL C library on the client side expects a private key from the SSL server, which is not a requirement in a pure java SSL client implementation. Using the Java keytool command, you cannot export a private key directly into the pem format understood by Python. You need an intermediate PKCS12 format.

Let’s step through the procedure to create certificates and keys:

1) Generate keystore.jks containing private key (used by Hive to decrypt messages received from Hue over SSL) and public certificate (used by Hue to encrypt messages over SSL)

keytool -genkeypair -alias certificatekey -keyalg RSA -validity 7 -keystore
keystore.jks

2) Generate certificate from keystore

keytool -export -alias certificatekey -keystore keystore.jks -rfc -file
cert.pem

3) Export private key and certificate with openSSL for Hue’s SSL library to ingest.

Exporting the private key from a jks file (Java keystore) needs an intermediate PKCS12:

a. Import the keystore from JKS to PKCS12

keytool -importkeystore -srckeystore keystore.jks -destkeystore keystore.p12
-srcstoretype JKS -deststoretype PKCS12 -srcstorepass mysecret -deststorepass
mysecret -srcalias certificatekey -destalias certificatekey -srckeypass
mykeypass -destkeypass mykeypass -noprompt

b. Convert pkcs12 to pem using OpenSSL

openssl pkcs12 -in keystore.p12 -out keystore.pem -passin pass:mysecret
-passout pass:mysecret

c. Strip the pass phrase so Python doesn’t prompt for password while connecting to Hive

openssl rsa -in keystore.pem -out hue_private_keystore.pem

Then the following needs to be setup in Hue’s configuration file hue.ini under [beeswax] section:

  [[ssl]]
    # SSL communication enabled for this server.
    enabled=true

    # Path to the private key file.
    key=/path/to/hue_private_keystore.pem

    # Path to the public certificate file.
    cert=/path/to/cert.pem

    # Choose whether Hue should validate certificates received from the server.
    validate=false

Then make sure no custom authentication mechanism is turned on and configure your hive-site.xml with the following properties on Hive 0.13:

 <property>
   <name>hive.server2.use.SSL</name>
   <value>true</value>
 </property>

 <property>
   <name>hive.server2.keystore.path</name>
   <value>/path/to/keystore.jks</value>
 </property>

 <property>
   <name>hive.server2.keystore.password</name>
   <value>mysecret</value>
 </property>

Note

On Hive 0.12, the property is hive.server2.enable.SSL instead of hive.server2.use.SSL.

That’s it—you’re done!

As usual feel free to comment and send feedback on the hue-user list or @gethue!

Suhas Satish

Hadoop Ecosystem Software Developer, MapR

Suhas Satish is a Hadoop ecosystem software developer at MapR Technologies and has contributed to Apache Pig, Hue, Hive, Flume and Sqoop projects. Suhas has an MS in Computer Engineering from North Carolina State University and a B.Tech in Electronics & Communications Engineering from National Institute of Technology Karnataka, Surathkal, India.

↧

How to configure Hue for your Hadoop cluster

October 2, 2014, 3:01 pm

≫ Next: Running an Oozie workflow and getting Split class org.apache.oozie.action.hadoop.OozieLauncherInputFormat$EmptySplit not found

≪ Previous: SSL Encryption between Hue and Hive

Hue is a lightweight Web server that lets you use Hadoop directly from your browser. Hue is just a ‘view on top of any Hadoop distribution’ and can be installed on any machine. There are multiples ways (cf. ‘Download’ section of gethue.com) to install Hue. The next step is to configure Hue to point to your Hadoop cluster. By default Hue assumes a local cluster (i.e. there is only one machine) is present. In order to interact with a real cluster, Hue needs to know on which hosts are distributed the Hadoop services. Hue main configuration happens in a hue.ini file. It lists a lot of options but essentially what are the addresses and ports of HDFS, YARN, Oozie, Hive… Depending on the distribution you installed the ini file is located:

CDH package: /etc/hue/conf/hue.ini
A tarball release: /usr/share/desktop/conf/hue.ini
Development version: desktop/conf/pseudo-distributed.ini
Cloudera Manager: CM generates all the hue.ini for you, so no hassle /var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk ‘{print $9}’`/hue.ini

At any time, you can see the path to the hue.ini and what are its values on the /desktop/dump_config page. Then, for each Hadoop Service, Hue contains a section that needs to be updated with the correct hostnames and ports. In some cases, as explained in how to configure Hadoop for Hue documentation, the API of these services needs to be turned on and Hue set as proxy user. Here is an example of the Hive section in the ini file:

[beeswax]

  # Host where HiveServer2 is running.
  hive_server_host=localhost

To point to another server, just replaced the host value by ‘hiveserver.ent.com':

[beeswax]

  # Host where HiveServer2 is running.
  hive_server_host=hiveserver.ent.com

Note Any line starting with a # is considered as a comment so is not used. Note The list of mis-configured services are listed on the /about/admin_wizard page. Note After each change in the ini file, Hue should be restarted to pick it up. Here are the main sections that you will need to update in order to have each service accessible in Hue:

HDFS

This is required for listing or creating files. Replace localhost by the real address of the NameNode (usually http://localhost:50070).

[hadoop]

  [[hdfs_clusters]]

    [[[default]]]

      # Enter the filesystem uri
      fs_defaultfs=hdfs://localhost:8020

      # Use WebHdfs/HttpFs as the communication mechanism.
      # Domain should be the NameNode or HttpFs host.
      webhdfs_url=http://localhost:50070/webhdfs/v1

YARN

The Resource Manager is often on http://localhost:8088 by default. The ProxyServer and Job History servers also needs to be specified. Then Job Browser will let you list and kill running applications and get their logs.

[hadoop]

  [[yarn_clusters]]

    [[[default]]]

      # Enter the host on which you are running the ResourceManager
      resourcemanager_host=localhost     

      # Whether to submit jobs to this cluster
      submit_to=True

      # URL of the ResourceManager API
      resourcemanager_api_url=http://localhost:8088

      # URL of the ProxyServer API
      proxy_api_url=http://localhost:8088

      # URL of the HistoryServer API
      history_server_api_url=http://localhost:19888

Hive

Here we need a running HiveServer2 in order to send SQL queries.

[beeswax]

  # Host where HiveServer2 is running.
  hive_server_host=localhost

Impala

We need to specify one of the Impalad address for interactive SQL in the Impala app.

[impala]

  # Host of the Impala Server (one of the Impalad)
  server_host=localhost

Solr Search

We just need to specify the address of a Solr Cloud (or non Cloud Solr), then interactive dashboards capabilities are unleashed!

[search]

  # URL of the Solr Server
  solr_url=http://localhost:8983/solr/

Oozie

An Oozie server should be up and running before submitting or monitoring workflows.

[liboozie]

  # The URL where the Oozie service runs on.
 oozie_url=http://localhost:11000/oozie

Pig

The Pig Editor requires Oozie to be setup with its sharelib.

HBase

The HBase app works with a HBase Thrift Server version 1. It lets you browse, query and edit HBase tables.

[hbase]

  # Comma-separated list of HBase Thrift server 1 for clusters in the format of '(name|host:port)'.
 hbase_clusters=(Cluster|localhost:9090)

Sentry

Hue just needs to point to the machine with the Sentry server running.

[libsentry]

  # Hostname or IP of server.
  hostname=localhost

Note To override a value in Cloudera Manager, you need to enter each above mini section into the Hue Safety Valve: Hue Service → Configuration → Service-Wide → Advanced → Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini And that’s it! Now Hue will let you do Big Data directly from your browser without touching the command line! You can then follow-up with some tutorials. As usual feel free to comment and send feedback on the hue-user list or @gethue!

↧

Running an Oozie workflow and getting Split class org.apache.oozie.action.hadoop.OozieLauncherInputFormat$EmptySplit not found

October 3, 2014, 11:22 am

≫ Next: Apache Sentry made easy with the new Hue Security App

≪ Previous: How to configure Hue for your Hadoop cluster

If after installing your cluster and submitting some Oozie jobs you are seeing this type of error:

  Error: java.io.IOException: Split class org.apache.oozie.action.hadoop.OozieLauncherInputFormat$EmptySplit not found
  at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:363)
  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:423)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:396)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
  Caused by: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.OozieLauncherInputFormat$EmptySplit not found
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1953)
  at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:361)
  ... 7 more

This is because the Oozie Share Lib is not installed. Here is one command to install the YARN one:

sudo -u oozie /usr/lib/oozie/bin/oozie-setup.sh sharelib create -fs hdfs://localhost:8020 -locallib /usr/lib/oozie/oozie-sharelib-yarn.tar.gz

  setting JAVA_LIBRARY_PATH="$JAVA_LIBRARY_PATH:/usr/lib/hadoop/lib/native"
  setting OOZIE_DATA=/var/lib/oozie
  setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat
  setting CATALINA_TMPDIR=/var/lib/oozie
  setting CATALINA_PID=/var/run/oozie/oozie.pid
  setting CATALINA_BASE=/var/lib/oozie/tomcat-deployment
  setting OOZIE_HTTPS_PORT=11443
  setting OOZIE_HTTPS_KEYSTORE_PASS=password
  setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.port=${OOZIE_HTTPS_PORT}"
  setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.keystore.pass=${OOZIE_HTTPS_KEYSTORE_PASS}"
  setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m"
  setting OOZIE_CONFIG=/etc/oozie/conf
  setting OOZIE_LOG=/var/log/oozie
  setting JAVA_LIBRARY_PATH="$JAVA_LIBRARY_PATH:/usr/lib/hadoop/lib/native"
  setting OOZIE_DATA=/var/lib/oozie
  setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat
  setting CATALINA_TMPDIR=/var/lib/oozie
  setting CATALINA_PID=/var/run/oozie/oozie.pid
  setting CATALINA_BASE=/var/lib/oozie/tomcat-deployment
  setting OOZIE_HTTPS_PORT=11443
  setting OOZIE_HTTPS_KEYSTORE_PASS=password
  setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.port=${OOZIE_HTTPS_PORT}"
  setting CATALINA_OPTS="$CATALINA_OPTS -Doozie.https.keystore.pass=${OOZIE_HTTPS_KEYSTORE_PASS}"
  setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m"
  setting OOZIE_CONFIG=/etc/oozie/conf
  setting OOZIE_LOG=/var/log/oozie
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/oozie/libserver/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/oozie/libserver/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
the destination path for sharelib is: /user/oozie/share/lib/lib_20141003111250

And how to check it:

sudo -u oozie oozie  admin -shareliblist -oozie http://localhost:11000/oozie

[Available ShareLib]
oozie
hive
distcp
hcatalog
sqoop
mapreduce-streaming
hive2
pig

Note
If you have upgraded your cluster, use ‘upgrade’ instead of ‘create':

sudo -u oozie /usr/lib/oozie/bin/oozie-setup.sh sharelib upgrade -fs hdfs://localhost:8020 -locallib /usr/lib/oozie/oozie-sharelib-yarn.tar.gz

And now restart Oozie:

sudo service oozie restart

That’s it, you are now ready to submit workflows!

As usual feel free to comment and send feedback on the hue-user list or @gethue!

↧

Apache Sentry made easy with the new Hue Security App

October 7, 2014, 1:24 pm

≫ Next: File Browser Enhancements: HDFS operations made easy

≪ Previous: Running an Oozie workflow and getting Split class org.apache.oozie.action.hadoop.OozieLauncherInputFormat$EmptySplit not found

Hi Hadoop Sheriffs,

In order to support the growth of the Apache Sentry project and make it easier to secure your cluster, a new app was added into Hue. Sentry privileges determine which Hive / Impala databases and tables a user can see or modify. The Security App let’s you create/edit/delete Roles and Privileges directly from your browser (there is no sentry-provider.ini file to edit anymore).

Here is a video showing how the app works:

Main features:

Bulk edit roles and privileges
Visualize/edit roles and privileges on a database tree
WITH GRANT OPTION support
Impersonate a user to see which databases and table he can see

To have Hue point to a Sentry service and another host, modify these hue.ini properties:

[libsentry]
  # Hostname or IP of server.
  hostname=localhost

  # Port the sentry service is running on.
  port=8038

  # Sentry configuration directory, where sentry-site.xml is located.
  sentry_conf_dir=/etc/sentry/conf

Hue will also automatically pick up the server name of HiveServer2 from the sentry-site.xml file of /etc/hive/conf.

And that’s it, you can know specify who can see/do what directly in a Web UI! The app sits on top of the standard Sentry API and so it fully compatible with Sentry. Next planned features will bring Solr Collections, HBase privilege management as well as more bulk operations and a tighter integration with HDFS.

As usual, feel free to continue to send us questions and feedback on the hue-user list or @gethue!

Notes

We are using the latest CDH5.2 with Kerberos MIT and Sentry configured. The app also works in non secure mode.

Our users are:

hive (admin) belongs to the hive group
user1_1 belongs to the user_group1 group
user2_1 belongs to the user_group2 group

We synced the Unix users/groups into Hue with these commands:

export HUE_CONF_DIR="/var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'`"

build/env/bin/hue useradmin_sync_with_unix --min-uid=1000

If using the package version and has the CDH repository register, install sentry with:

sudo apt-get install sentry

If using Kerberos, make sure ‘hue’ is allowed to connect to Sentry in /etc/sentry/conf/sentry-store-site.xml:

<property>
    <name>sentry.service.allow.connect</name>
    <value>impala,hive,solr,hue</value>
</property>

For testing purposes, here is how to create the initial Sentry database:

romain@runreal:~/projects/hue$ sentry --command schema-tool -initSchema -conffile /etc/sentry/conf/sentry-store-site.xml -dbType derby

And start the service:

sentry --command service  -conffile /etc/sentry/conf/sentry-store-site.xml

↧

File Browser Enhancements: HDFS operations made easy

October 8, 2014, 10:01 am

≫ Next: Search App Enhancements: Explore even more Data

≪ Previous: Apache Sentry made easy with the new Hue Security App

A lot of exciting work has been done on File Browser to provide additional features and the best user experience possible. Take a look at the updates below and start using them today!

Drag and Drop Uploading
Streamlining file uploads was something that has been on our to-do list. You can now drag files from your desktop into the File Browser. The files will then be uploaded to the current directory you are viewing in File Browser.

Quick Links
We received feedback requesting a history of paths visited in File Browser. This sounded good to us so Hue now offers a History menu that keeps a running tab of the paths you visited. Currently we show the previous 10 path changes with latest displayed first.

Cleaner Actions Bar
The actions bar was getting cluttered so we decided to streamline the actions bar by providing an Actions drop-down menu that houses the commands previously listed across the top of File Browser. As a result there is less clutter and actions are grouped in a logical menu. We have kept high-use actions exposed as they are today as well as included them in the Actions menu.

Actions Context Menu
In addition to the Actions menu, Hue now offers an Actions context menu to allow for quicker selection of actions to take on an item in the File Browser. Take advantage of this feature by context clicking an item in File Browser (right-clicking with a mouse or two-finger tapping on a touch pad).

Improved Copy/Move Modals
The modal windows spawned from Copy and Move actions have been redesigned. A key feature is the new directory tree that allows you to not only select the directory to move or copy your files but also to see sub directories and create new directories directly from the tree.

Edit Permissions, Groups, Users Inline
As a superuser you may now edit the permissions, group and user attributes of a selected item in File Browser. This is in addition to performing these operations from the Actions menu. Non superusers can now edit permissions inline for the items they own.

Simplified Pagination
We took a look at pagination in File Browser and decided it was time to simplify and make it more concise at the same time. The result is a cleaner look and feel that provides enhanced readability and ease of use. In addition you can now show up to 1000 items on the page.

User Experience Improvements
Improved breadcrumb size for readability and increased click target
File Browser now provides users a note when viewing compressed data
File viewing no longer supports scrolling for file types that do not support pagination

We hope these new features allow you to be more productive and result in a better Hue experience. Let us know on the hue-user list or @gethue if there is something you would like to see!

↧

Search App Enhancements: Explore even more Data

October 8, 2014, 12:12 pm

≫ Next: Improved Oozie Dashboard: Bulk manipulate your Jobs

≪ Previous: File Browser Enhancements: HDFS operations made easy

Hi Big Data Explorers,

Hue Search dashboards introduced new ways to quickly explore a lot of data by drag & dropping some graphical widgets and leveraging Solr capabilities. The application received a lot of feedback and has been greatly improved.

Here is a short video detailing all the new stuff:

You can see a quick summary of the main novelties below:

Top Bar

It was re-organized in order to split widgets displaying records or facets.

hue-bar

Three new widgets

Heatmap
Tree
Marker Map

Based on the Pivot facet feature of Solr, the Heatmap and Tree let you explore in 2D or nDimensions your data. For example you can plot the distribution of OS by Browser by country, IP by cities… They are both clickable, meaning you can filter the results by selecting certain values.

hue-heatmap

hue-tree-1

The Marker Map is great for automatically plotting the result rows on a leaflet.

hue-marker

Field analysis

Index fields can now have their terms and stats retrieved in a single click. Accessible from the list of fields of the Grid Layout. Values can also be in/out excluded, prefix filtered or faceted by another field.

hue-analysist-terms hue-analysis-stats

Exclude facets

Previously it was only possible to included selected facets. Now a little minus ‘-’ appears on hover on each value and let you filter out some values. These can be combined for filtering out some large values that make your graph difficult to read.

hue-exclude-0 hue-exclude-1

Note

The indexer is now smarter and will pick up the good ZooKeeper server: search examples are installable in one click (a few more clicks still with Kerberos/Sentry).

What’s next?

New facets like ‘this value & up!’, map plotting of new type of data, making it easier to create and index data, autocomplete (and ideally one day Analytics Facets) are in the pipeline!

We hope that you like the latest additions to the Search App. Feel free to continue to send us questions and feedback to the hue-user list or @gethue!

↧

Improved Oozie Dashboard: Bulk manipulate your Jobs

October 9, 2014, 9:03 am

≫ Next: Hue 3.7 with Sentry App and new Search widgets is out!

≪ Previous: Search App Enhancements: Explore even more Data

Hue Oozie Dashboard just got a few improvements in order to make Oozie job management less tedious! Here is a video demo that sums them up:

Main new Dashboard features

Faster page display
Bulk suspend/kill/resume jobs
Bulk rerun failed coordinator actions
New Metrics section

More is on the roadmap!

For example, it is planned to have an even quicker rerun of coordinator actions, a brand new & simple Workflow Editor and a monitoring/visualization of the running/past workflows/coordinators/bundles.

As usual, feel free to send your wish list and feedback to the hue-user list or @gethue!

↧

Hue 3.7 with Sentry App and new Search widgets is out!

October 9, 2014, 10:09 am

≫ Next: Team Retreat in Tenerife

≪ Previous: Improved Oozie Dashboard: Bulk manipulate your Jobs

Hi Big Data Surfers,

The Hue Team is glad to release Hue 3.7 and its new Sentry App and improved Search App!

A tarball is available as well as documentation and release notes. Packages will be released during Hadoop World next week. Despite shipping a brand new app, this release is particularly feature-packed. This might be due to all the good feedback and requests that we are receiving (and maybe an additional amazing team retreat).

Here is a list of the main improvements:

Security

New Sentry App
Bulk edit roles and privileges
Visualize/edit roles and privileges on a database tree
WITH GRANT OPTION support
Impersonate a user to see which databases and table he can see
More details here…

Search

Three new widgets
- Heatmap
- Tree
- Marker Map
Field Analysis
Exclude Facets
More details here…

Oozie

Bulk suspend/kill/resume actions on dashboards
Faster dashboards
Rerun failed coordinator instances in bulk
More details here…

Job Browser

Kill application button for YARN

File Browser

ACLs Edition
Drag & Drop upload
Navigation History
Simpler interface
More details here…

HBase

Kerberos support. Next step will be impersonation!

Indexer

Pickup the configured Zookeeper and give a hint if pointing to the wrong Solr. Useful for installing examples in one click.

Hive / Impala

SDK

We are also trying to make the Hue project easier to develop:

What’s next?

Next planned features will bring a new and elegant Oozie Workflow Editor, faster performances and High Availability (HA), a surprise app, a simpler Spark App, more integration with Sentry and Search, and tons of polishing and ironing.

As usual, feel free to comment and send feedback on the hue-user list or @gethue!

↧

Team Retreat in Tenerife

November 12, 2014, 6:41 am

≫ Next: Solr Lucene Revolution DC 14 Presentation: Interactively Search and Visualize Your Big Data

≪ Previous: Hue 3.7 with Sentry App and new Search widgets is out!

¡Hola!

The Hue project recently got its biggest release. It was once again time to celebreate and also prepare the next big thing

For the first time, a European location was picked: Tenerife, in the Canary Islands! Similarly to the previous retreat in Hawaii, the team could enjoy a great island with a Volcano, warm water, sun, sea food, tapas, wind & kitesurfing.

At the end of the week, the team was recharged, the core of the new Oozie App for creating workflows was done, and much more is under way!

Hue Team

Historical village

Ocen Front Villa

2014-11-08 10.24.02

2014-11-05 18.52.19
Delicious taps & seafood

2014-11-03 13.48.02

2014-11-07 22.25.09
Some action

2014-11-02 09.39.52

P1040488

Volcano & Nature

↧

Solr Lucene Revolution DC 14 Presentation: Interactively Search and Visualize Your Big Data

November 13, 2014, 12:57 pm

≫ Next: Updated SAML 2.0 Support

≪ Previous: Team Retreat in Tenerife

Interactively Search and Visualize Your Big Data
Presented by Romain Rigaux, Cloudera

Open up your user base to the data! Contrary to programming and SQL, almost everybody knows how to search. This talk describes through an interactive demo based on open source Hue how users can graphically search their data in Hadoop. The underlying technical details of the application and its interaction with Apache Solr will be clarified.

The session will detail how to get started with data indexing in just a few clicks as well as explore several data analysis scenarios. Through a web browser, attendees will be shown how to explore and visualize data for quick answers. The new search dashboard in Hue, with its draggable charts and dynamic interface, lets any non-technical user look for documents or patterns.

Attendees of this talk will learn how to get started with interactive search visualization in their Hadoop cluster.

Some photos

↧

Updated SAML 2.0 Support

November 21, 2014, 4:56 pm

≫ Next: Hadoop YARN: 1/1 local-dirs are bad: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers

≪ Previous: Solr Lucene Revolution DC 14 Presentation: Interactively Search and Visualize Your Big Data

Hue has been shipping SAML 2.0 authentication for quite some time. SAML 2.0 is an alternative to LDAP which lets you provide single sign on (SSO) in your company so that users can use the same login/password in all the systems. Unfortunately our support for SAML 2.0 was limited to a few small use cases.

In the upcoming Hue 3.8.0 / CDH 5.3.0, we now support all of the main SAML 2.0 web profile features. In addition to allowing for single login, Hue now lets you perform:

Single-Logout
Signed requests and responses
Use an Alternative NameID

Along the way we fixed a number of bugs in Hue (HUE-2458) and contributed back a number of fixes to the awesome Python libraries PySAML2 and djangosaml2. With these changes, Hue now is able to use these upstream packages instead of our old fork of these repositories. Here’s below an updated version of the SAML 2.0 guide.

Have any questions? Feel free to contact us on hue-user or @gethue!

The Basics

In SAML 2.0, there are 2 basic components: the Service Provider (SP) and Identity Provider (IdP). The typical flow from SP to IdP is made obvious in following image.

SAML architecture from http://en.wikipedia.org/wiki/SAML_2.0.

Hue acts as a service provider with an assertion consumer service (ACS). It communicates with the IdP to authenticate users. Hue also provides a couple of URLs that enable communication with the IdP:

“/saml2/metadata”
“/saml2/acs”

The IdP will contact the metadata URL for information on the SP. For example, the ACS URL is described in metadata. The ACS URL is the consumer of assertions from the IdP. The IdP will redirect users to the ACS URL once it has authenticated them.

Users

When a user logs into Hue through the SAML backend, a new user is created in Hue if it does already not exist. This logic is almost the same as the LdapBackend. It is also configurable via the create_users_on_login parameter.

Demo

The following is a demo of how to setup Hue to communicate via SAML with a Shibboleth IdP.

Environment

This demo is performed on CentOS 6.4 and assumes the following projects have been installed and configured:

Shibboleth 2.4.0 – IdP
OpenDS 2.2.1 – Authentication service
Tomcat 6 – Server for IdP

Shibboleth IdP is installed to “/opt/shibboleth-idp” and has the following custom configurations:

Release the UID attribute with assertions.
Available over SSL on port 8443.
Provide authentication via LDAP through OpenDS.
Connect to a relying party that contains metadata about the SP. In this case, the relying party is Hue and its metadata URL is “/saml2/metadata”.
Use the UsernamePassword handler. It provides very obvious feedback that all components have been configured appropriately.
Available to all IPs.

OpenDS was installed and 2000 users were automatically generated. Then, a user “test” was added with the password “password”.

Preparing Hue

The libraries that support SAML in Hue must be installed:

build/env/bin/pip install djangosaml2

The above commands will also install:

Paste
WebOb
argparse
cffi
cryptography
decorator
pyOpenSSL
pycparser
pycrypto
pysaml2
python-dateutil
python-memcached
pytz
repoze.who
requests
six
wsgiref
zope.interface

Note: The SAML libraries are dependent on xmlsec1 being available on the machine. This will be need to be installed and readily available for Hue to use.

Configuring Hue

Hue must be configured as a SP and use the SAML authentication backend.

1. Hue as a Service Provider

In the SAML 2.0 architecture, Hue acts as the SP. As such, it must be configured to communicate with the IdP in the hue.ini:

[libsaml]
xmlsec_binary=/opt/local/bin/xmlsec1
metadata_file=/tmp/metadata.xml
key_file=/tmp/key.pem
cert_file=/tmp/cert.pem

The key_file and cert_file can be copied from the Shibboleth IdP credentials directory (“/opt/shibboleth-idp/credentials/”). The files idp.key and kdp.crt correspond to cert_file and key_file, respectively. These files should already be in PEM format, so for purposes of this demo, they are renamed to key.pem and cert.pem.

The metadata_file is set to the file containing the IdP metadata (“/tmp/metadata.xml”). This can be created from the XML response of “http://<SHIBBOLETH HOST>:8443/idp/shibboleth/”. The XML itself may require some massaging. For example, in some fields, the port 8443 is missing from certain URLs.

The table below describes the available parameters for SAML in the hue.ini.

Parameter                  Description

xmlsec_binary                Xmlsec1 binary path. This program should be executable by the user running Hue.
create_users_on_login        Create users received in assertion response upon successful authentication and login.
required_attributes          Required attributes to ask for from IdP.
optional_attributes          Optional attributes to ask for from IdP.
metadata_file                IdP metadata in the form of a file. This is generally an XML file containing metadata that the Identity Provider generates.
key_file                     Private key to encrypt metadata with.
cert_file                    Signed certificate to send along with encrypted metadata.
user_attribute_mapping       A mapping from attributes in the response from the IdP to django user attributes.

Hue SAML configuration parameters.

2. SAML Backend for Logging-in

The SAML authentication backend must be used so that users can login and be created:

[desktop]
  [[auth]]
  backend=libsaml.backend.SAML2Backend

SAML and Hue in Action

Now that Hue has been setup to work with the SAML IdP, attempting to visit any page redirects to Shibboleth’s login screen:

Shibboleth login screen after attempting to access /about.

After logging in, Hue is readily available and visible!

Summary

Providing SSO support through SAML helps enterprises by enabling centralized authentication. From a user’s perspective, life is easier because it removes the burden of password management. After a user has logged in, they adhere to the same permissions and rules as other users.

Have any suggestions? Feel free to tell us what you think through hue-user or at @gethue.

↧

Hadoop YARN: 1/1 local-dirs are bad: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers

December 3, 2014, 3:59 pm

≫ Next: How to use HCatalog with Pig in a secured cluster

≪ Previous: Updated SAML 2.0 Support

If you are getting this error, make some disk space!

1/1 local-dirs are bad: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers

Node Manager logs


yarn.server.nodemanager.DirectoryCollection: Directory /var/lib/hadoop-yarn/cache/yarn/nm-local-dir error, used space above threshold of 90.0%, removing from list of valid directories
2014-11-17 17:45:00,713 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /var/log/hadoop-yarn/containers error, used space above threshold of 90.0%, removing from list of valid directories
2014-11-17 17:45:00,713 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) failed: 1/1 local-dirs are bad: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir; 1/1 log-dirs are bad: /var/log/hadoop

Resource Manager logs

2014-11-17 16:57:07,301 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node localhost:34650 reported UNHEALTHY with details: 1/1 local-dirs are bad: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers

↧

How to use HCatalog with Pig in a secured cluster

December 9, 2014, 12:43 pm

≫ Next: How to run Hue with the Apache Server

≪ Previous: Hadoop YARN: 1/1 local-dirs are bad: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers

In Hue 3.0 we made transparent the use of HCatalog in the Pig scripts. Today, we are going to detail how to run Pig script with HCatalog in some secured cluster.

The process is somehow still complicated, we will try to make it transparent to the user in HUE-2480.

As usual, if you have questions or feedback, feel free to contact the Hue community on hue-user or @gethue.com!

Pig script to execute

We are going to use this simple script that display the first records of one of the sample Hive tables:

-- Load table 'sample_07'
sample_07 = LOAD 'sample_07' USING org.apache.hcatalog.pig.HCatLoader();

out = LIMIT sample_07 15;

DUMP out;

Make sure that the Oozie Share Lib is installed

As usual, if it is missing, some jars won’t be found and you will get:

ERROR 1070: Could not resolve org.apache.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Could not resolve org.apache.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]

Oozie Editor

Oozie let’s you chain and schedule jobs together. This is a bit tricky. In the Pig action, make sure that you click on the ‘Advanced’ link and check the HCat Credential. Upload the ‘hive-site.xml’ used by Hue and fill the ‘Job XML’ field.

In the workflow properties, make sure that these Oozie properties are set:

oozie.use.system.libpath true
oozie.action.sharelib.for.pig pig,hcatalog

That’s it!

Pig Editor

To make it work in the Pig Editor in secure mode, you will need HUE-2152 or Hue 3.8 / CDH5.4 (but not needed if not using Kerberos).

Then just upload the hive-site.xml used by Hue and add it as a ‘File’ resource in the properties of the script. Contrary to the Hive action, the name must be ‘hive-site.xml’.

And that’s it!

Appendix

Examples of XML workflow

<workflow-app name="pig-app-hue-script" xmlns="uri:oozie:workflow:0.4">
  <credentials>
    <credential name="hcat" type="hcat">
      <property>
        <name>hcat.metastore.uri</name>
        <value>thrift://hue-c5-sentry.ent.cloudera.com:9083</value>
      </property>
      <property>
        <name>hcat.metastore.principal</name>
        <value>hive/hue-c5-sentry.ent.cloudera.com@ENT.CLOUDERA.COM</value>
      </property>
    </credential>
  </credentials>
    <start to="pig"/>
    <action name="pig" cred="hcat">
        <pig>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <script>/user/hue/oozie/workspaces/_hive_-oozie-253-1418153366.31/script.pig</script>
            <file>/user/hue/oozie/workspaces/_hive_-oozie-242-1418149386.4/hive-site.xml#hive-site.xml</file>
        </pig>
        <ok to="end"/>
        <error to="kill"/>
    </action>
    <kill name="kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

Properties

Name 	Value
credentials 	{u'hcat': {'xml_name': u'hcat', 'properties': [('hcat.metastore.uri', u'thrift://hue-c5-sentry.ent.cloudera.com:9083'), ('hcat.metastore.principal', u'hive/hue-c5-sentry.ent.cloudera.com@ENT.CLOUDERA.COM')]}, u'hive2': {'xml_name': u'hive2', 'properties': [('hive2.jdbc.url', 'jdbc:hive2://hue-c5-sentry.ent.cloudera.com:10000/default'), ('hive2.server.principal', u'hive/hue-c5-sentry.ent.cloudera.com@ENT.CLOUDERA.COM')]}, u'hbase': {'xml_name': u'hbase', 'properties': []}}
hue-id-w 	253
jobTracker 	hue-c5-sentry.ent.cloudera.com:8032
mapreduce.job.user.name 	hive
nameNode 	hdfs://hue-c5-sentry.ent.cloudera.com:8020
oozie.action.sharelib.for.pig 	pig,hcatalog
oozie.use.system.libpath 	true
oozie.wf.application.path 	hdfs://hue-c5-sentry.ent.cloudera.com:8020/user/hue/oozie/workspaces/_hive_-oozie-253-1418153366.31
user.name 	hive

If you get the dreaded ‘ERROR 2245: Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader’ error this could be that the hive-site.xml is not added or that you needHUE-2152 that injects the HCat credential in the script.

ERROR 2245: Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader

  org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
  at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1689)
  at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1409)
  at org.apache.pig.PigServer.parseAndBuild(PigServer.java:342)
  at org.apache.pig.PigServer.executeBatch(PigServer.java:367)
  at org.apache.pig.PigServer.executeBatch(PigServer.java:353)
  at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
  at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769)
  at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
  at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
  at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
  at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
  at org.apache.pig.Main.run(Main.java:478)
  at org.apache.pig.PigRunner.run(PigRunner.java:49)
  at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:286)
  at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:226)
  at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:39)
  at org.apache.oozie.action.hadoop.PigMain.main(PigMain.java:74)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:227)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
  at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:370)
  at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:295)
  at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:181)
  at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:224)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
  Caused by: Failed to parse: Can not retrieve schema from loader org.apache.hcatalog.pig.HCatLoader@5e2e886d
  at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
  at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1676)
  ... 33 more
  Caused by: java.lang.RuntimeException: Can not retrieve schema from loader org.apache.hcatalog.pig.HCatLoader@5e2e886d
  at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:91)
  at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:853)
  at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
  at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
  at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
  at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
  at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
  at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
  ... 34 more
  Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2245: Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
  at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:179)
  at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89)
  ... 41 more
  Caused by: java.io.IOException: java.lang.Exception: Could not instantiate a HiveMetaStoreClient connecting to server uri:[null]
  at org.apache.hcatalog.pig.PigHCatUtil.getTable(PigHCatUtil.java:205)
  at org.apache.hcatalog.pig.HCatLoader.getSchema(HCatLoader.java:195)
  at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
  ... 42 more
  Caused by: java.lang.Exception: Could not instantiate a HiveMetaStoreClient connecting to server uri:[null]
  at org.apache.hcatalog.pig.PigHCatUtil.getHiveMetaClient(PigHCatUtil.java:160)
  at org.apache.hcatalog.pig.PigHCatUtil.getTable(PigHCatUtil.java:200)
  ... 44 more
  Caused by: com.google.common.util.concurrent.UncheckedExecutionException: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
  NestedThrowables:
  java.lang.reflect.InvocationTargetException
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2234)
  at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
  at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4764)
  at org.apache.hcatalog.common.HiveClientCache.getOrCreate(HiveClientCache.java:167)
  at org.apache.hcatalog.common.HiveClientCache.get(HiveClientCache.java:143)
  at org.apache.hcatalog.common.HCatUtil.getHiveClient(HCatUtil.java:548)
  at org.apache.hcatalog.pig.PigHCatUtil.getHiveMetaClient(PigHCatUtil.java:158)
  ... 45 more
  Caused by: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
  NestedThrowables:
  java.lang.reflect.InvocationTargetException
  at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:587)
  at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:781)
  at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:326)
  at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:195)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
  at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
  at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
  at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
  at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:313)
  at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:342)
  at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:249)
  at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:224)
  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
  at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
  at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58)
  at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67)
  at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:506)
  at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:484)
  at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:532)
  at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:406)
  at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.<init>(HiveMetaStore.java:365)
  at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:55)
  at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:60)
  at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4953)
  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:172)
  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:155)
  at org.apache.hcatalog.common.HiveClientCache$CacheableHiveMetaStoreClient.<init>(HiveClientCache.java:246)
  at org.apache.hcatalog.common.HiveClientCache$4.call(HiveClientCache.java:170)
  at org.apache.hcatalog.common.HiveClientCache$4.call(HiveClientCache.java:167)
  at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4767)
  at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
  at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
  ... 51 more
  Caused by: java.lang.reflect.InvocationTargetException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
  at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
  at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:281)
  at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:239)
  at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:292)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
  at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
  at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1069)
  at org.datanucleus.NucleusContext.initialise(NucleusContext.java:359)
  at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:768)
  ... 89 more
  Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("org.apache.derby.jdbc.EmbeddedDriver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
  at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:237)
  at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:110)
  at org.datanucleus.store.rdbms.ConnectionFactoryImpl.<init>(ConnectionFactoryImpl.java:82)
  ... 107 more
  Caused by: org.datanucleus.store.rdbms.datasource.DatastoreDriverNotFoundException: The specified datastore driver ("org.apache.derby.jdbc.EmbeddedDriver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
  at org.datanucleus.store.rdbms.datasource.AbstractDataSourceFactory.loadDriver(AbstractDataSourceFactory.java:58)
  at org.datanucleus.store.rdbms.datasource.BoneCPDataSourceFactory.makePooledDataSource(BoneCPDataSourceFactory.java:61)
  at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:217)
  ... 109 more

↧

How to run Hue with the Apache Server

December 11, 2014, 3:21 pm

≫ Next: How to use Hue with Hive and Impala configured with LDAP authentication and SSL

≪ Previous: How to use HCatalog with Pig in a secured cluster

Hue ships out of the box with the HTTP server CherryPy, but some users have expressed interest having Apache HTTP 2 serve Hue with mod_wsgi. Their motivation is that they are more familiar with Apache or have already several Apache instances deployed.

It turns out it’s pretty simple to do. It only requires a small script, a Hue configuration option, and a configuration block inside Apache. This post describes how to have Apache serve the static content and run the Python code of Hue.

This script (which was just added in desktop/core/desktop/wsgi.py) enables any Web server that speaks WSGI to launch Hue and route requests to it:

import os
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "desktop.settings")

# This application object is used by the development server
# as well as any WSGI server configured to use this file.
from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()

The next step disables booting Hue from the runcpserver command. In Cloudera Manager, go to Hue > Configuration > Service-Wide > Advanced, and add the following to the hue safety valve:

If you are running Hue outside of Cloudera Manager, modify desktop/conf/hue.ini with:

[desktop]
  ...
  enable_server=no

The final step is to configure Apache to launch Hue by adding the following to the apache.conf:

WSGIScriptAlias / $HUE_PATH/desktop/core/src/desktop/wsgi.py
WSGIPythonPath $HUE_PATH/desktop/core/src/desktop:$HUE_PATH/build/env/lib/python2.7/site-packages
WSGIDaemonProcess $HOSTNAME home=$HUE_PATH python-path=$HUE_PATH/desktop/core/src/desktop:$HUE_PATH/build/env/lib/python2.7/site-packages threads=30
WSGIProcessGroup $HOSTNAME

<Directory $HUE_PATH/desktop/core/src/desktop>
<Files wsgi.py>
Order Deny,Allow

# If apache 2.4
Require all granted

# otherwise
#Allow from all
</Files>
</Directory>

Where $HOSTNAME should be the hostname of the machine running Hue, and $HUE_PATH is where Hue is installed. If you’re using Cloudera Manager, by default it should be either /usr/lib/hue for a package install, or /opt/cloudera/parcels/CDH/lib/hue for a parcel install.

Have any questions? Feel free to contact us on hue-user or @gethue!

↧

How to use Hue with Hive and Impala configured with LDAP authentication and SSL

December 12, 2014, 9:26 am

≫ Next: Team retreat in Nicaragua and Belize

≪ Previous: How to run Hue with the Apache Server

We previously showed in detail how to use SSL encryption with the Impala or Hive Editors. Here is now a step by step guide about how to use LDAP authentication instead of no authentication or Kerberos.

1.
HiveServer2 had SSL enabled so Hive Editor could not connect to HiveServer2. HiveServer2 logs showed SSL errors indicating that it received plaintext (good hint at the cause)

Solved by adding this to the Hue Safety Valve:

[beeswax]
  [[ssl]]
  ## SSL communication enabled for this server.
  enabled=false
  ## Path to Certificate Authority certificates.
  cacerts=/etc/hue/cacerts.pem
  ## Path to the private key file.
  key=/etc/hue/key.pem
  ## Path to the public certificate file.
  cert=/etc/hue/cert.pem
  ## Choose whether Hue should validate certificates received from the server.
  validate=false

(validate was false since their certificates used wildcards and this caused other errors)

Note: If not using SSL, you will hit this bug: HUE-2484

2.
The same Hue behavior occurred after making the change, but now the HiveServer2 log showed authentication failure due to err=49

So, we added the following to the Hue Safety Valve:

[desktop]
  ldap_username=
  ldap_password=

3.
Hue still showed the same behavior. HiveServer2 logs showed:

<HUE_LDAP_USERNAME> is not allowed to impersonate bob

We solved this by adding the following to the HDFS > Service-Wide ->Advanced>Safety Valve for core-site.xml.

<property>
  <name>hadoop.proxyuser.<HUE_LDAP_USERNAME>.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.<HUE_LDAP_USERNAME>.groups</name>
  <value>*</value>
</property>

4.
After this, the default database was displayed, but we could not do a show tables; or anything else. Beeline had the same behavior.

We did a grant for the group to which the user who was attempting the Hive actions and then that problem went away.

All queries were working and Hue is querying Hive/Impala and returning results!

As usual feel free to comment and send feedback on the hue-user list or @gethue!

↧

Team retreat in Nicaragua and Belize

December 15, 2014, 2:04 pm

≫ Next: How to deploy Hue on HDP

≪ Previous: How to use Hue with Hive and Impala configured with LDAP authentication and SSL

¡Hola!, Hello!

After a last escape in Europe, the Hue team went for some more exotic locations: Nicaragua & Belize!

After some “rough” driving from Managua to the coast, the team enjoyed some surfing and a visit of the Volcano island, Ometepe. Then it was time for a jump to the Carribean in Caye Caulker. There, lobsters, lobsters burritos and incredible sunsets were on the menu!

Hue Team

Nicaragua

P1040537

Belize

↧

How to deploy Hue on HDP

December 16, 2014, 11:36 am

≫ Next: Big Data Spain 2014: Big Data Web applications for Interactive Hadoop

≪ Previous: Team retreat in Nicaragua and Belize

Guest post from Andrew that we regularly update (Dec 19th 2014)

I decided to deploy Hue 3.7, from tarballs (note, other sources like packages from the ‘Install’ menu above would work too), on HDP 2.2 recently and wanted to document some notes for anyone else looking to do the same.

Deployment Background:

Node Operating System: CentOS 6.6 – 64bit
Cluster Manager: Ambari 1.7
Distribution: HDP 2.2
Install Path (default): /usr/local/hue
HUE User: hue

After compiling (some hints there), you may run into out of the box/post-compile startup issues.

Be sure to set the appropriate Hue proxy user/groups properties in your Hadoop service configurations (e.g. WebHDFS/WebHCat/Oozie/etc)

Don’t forget to configure your Hue configuration file (‘/usr/local/hue/desktop/conf/hue.ini’) to use FQDN hostnames in the appropriate places

Startup

Hue uses an SQLite database by default and you may find the following error when attempting to connect to HUE at its default port (e.g. fqdn:8888)

… File “/usr/local/hue/build/env/lib/python2.6/site-packages/Django-1.4.5-py2.6.egg/django/db/backends/sqlite3/base.py”, line 344, in execute return Database.Cursor.execute(self, query, params) DatabaseError: unable to open database file

SQLlite uses a file to store its databases, so this error most likely occurs due to invalid ownership settings for HUE-related files.
- We can fix this with the command ‘chown hue:hue -R /usr/local/hue’
For non development usage, we recommend to startup with MySql instead of SqlLite: https://github.com/cloudera/hue/blob/master/desktop/conf.dist/hue.ini#L292

Removing apps

For Impala (or any other app), the easiest way to remove them is to just black list them in the hue.ini. The second best alternative way is to remove the Hue permissions to the groups of some users.

[desktop]
app_blacklist=impala

For Sentry, you will need to use ‘security’, but it will also hide the HDFS ACLs editor for now.

HDFS

Check your HDFS configuration settings and ensure that the service is active and healthy.

With Ambari, you can go to the HDFS service settings and find this under “General”

– The property name is dfs.webhdfs.enabled (“WebHDFS enabled), and should be set to “true” by default.

– If a change is required, save the change and start/restart the service with the updated configuration.

Ensure the HDFS service is started and operating normally.

– You could quickly check some things, such as HDFS and WebHDFS by checking the WebHDFS page:

– http://<NAMENODE-FQDN>:50070/ in a web browser or ‘curl <NAMENODE-FQDN>:50070

Check if the processes are running using a shell command on your NameNode:

– ‘ps -ef | grep “NameNode”

By default your HDFS service(s) may not be configured to start automatically (e.g. upon boot/reboot).

Check the HDFS logs to see if the namenode service had trouble starting or started successfully:

- These are typically found at ‘/var/log/hadoop/hdfs/’

Hive Editor

By default, HUE appears to connect to the Hiveserver2 service using NOSASL authentication; Hive 0.14 ships with HDP 2.2 but is not configured by default to use authentication.

We’ll need to change the properties of our Hive configuration to work with the HUE Hive Editor (‘hive.server2.authentication=‘NOSASL’).

HDP 2.1 (Hive 0.13) continues to carry forward the GetLog() issue with Hue’s Hive Editor.e.g.

"Server does not support GetLog()"

In HDP 2.2, that includes Hive 0.14 and HIVE-4629, you will need this commit from Hue 3.8 (coming-up at the end of Q1 2015) or use master, and enable it in the hue.ini:

[beeswax]
# Choose whether Hue uses the GetLog() thrift call to retrieve Hive logs.
# If false, Hue will use the FetchResults() thrift call instead.
use_get_log_api=false

Security – HDFS ACLs Editor

By default, Hadoop 2.4.0 does not enable HDFS file access control lists (FACLs)

AclException: The ACL operation has been rejected. Support for ACLs has been disabled by setting dfs.namenode.acls.enabled to false. (error 403)
- Reference: http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html

We’ll need to change the properties of our HDFS namenode service to enable FACLs (‘dfs.namenode.acls.enabled’=’true’)

Spark

We are improving the Spark Editor and might change the Job Server and stuff is still pretty manual/not recommend for now.

HBase

Currently not tested (should work with Thrift Server 1)

Job Browser

Progress has never been entirely accurate for Map/Reduce completions — always shows the percentage for Mappers vs Reducers as a job progresses. “Kill” feature works correctly.

Oozie Editor/Dashboard

Does not appear to work. The job “kill” feature reports a successful job kill, however all “Running” jobs that have been “killed” continue to show a “RUNNING” status.

Note: I think I figured out what’s going on with Pig and Oozie — when Oozie is deployed via Ambari 1.7, for HDP 2.2, the sharelib files typically found at /usr/lib/oozie/ are missing, and in turn are not staged at hdfs:/user/oozie/share/lib/ …

I’ll check this against an HDP 2.1 deployment and write the guys at Hortonworks an email to see if this is something they’ve seen as well.

Pig Editor

Does not appear to work. An Oozie workflow/workspace does appear to get created, but all workflows get stuck in the “Prep” state. Must be related to a core issue with Oozie.

Note: make sure you have at least 2 nodes or tweak YARN to be able to launch two apps at the same time (gotcha #5)

The Pig/Oozie log looks like this:

2014-12-15 23:32:17,626  INFO ActionStartXCommand:543 - SERVER[hdptest.construct.dev] USER[amo] GROUP[-] TOKEN[] APP[pig-app-hue-script] JOB[0000001-141215230246520-<wbr />oozie-oozi-W] ACTION[0000001-<wbr />141215230246520-oozie-oozi-W@:<wbr />start:] Start action [0000001-141215230246520-<wbr />oozie-oozi-W@:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]

2014-12-15 23:32:17,627  INFO ActionStartXCommand:543 - SERVER[hdptest.construct.dev] USER[amo] GROUP[-] TOKEN[] APP[pig-app-hue-script] JOB[0000001-141215230246520-<wbr />oozie-oozi-W] ACTION[0000001-<wbr />141215230246520-oozie-oozi-W@:<wbr />start:] [***0000001-141215230246520-<wbr />oozie-oozi-W@:start:***]Action status=DONE

2014-12-15 23:32:17,627  INFO ActionStartXCommand:543 - SERVER[hdptest.construct.dev] USER[amo] GROUP[-] TOKEN[] APP[pig-app-hue-script] JOB[0000001-141215230246520-<wbr />oozie-oozi-W] ACTION[0000001-<wbr />141215230246520-oozie-oozi-W@:<wbr />start:] [***0000001-141215230246520-<wbr />oozie-oozi-W@:start:***]Action updated in DB!

2014-12-15 23:32:17,873  INFO ActionStartXCommand:543 - SERVER[hdptest.construct.dev] USER[amo] GROUP[-] TOKEN[] APP[pig-app-hue-script] JOB[0000001-141215230246520-<wbr />oozie-oozi-W] ACTION[0000001-<wbr />141215230246520-oozie-oozi-W@<wbr />pig] Start action [0000001-141215230246520-<wbr />oozie-oozi-W@pig] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]

↧

Big Data Spain 2014: Big Data Web applications for Interactive Hadoop

January 13, 2015, 4:33 pm

≫ Next: Configure Hue with HTTPS / SSL

≪ Previous: How to deploy Hue on HDP

Big Data Web applications for Interactive Hadoop
Presented by Enrico Berti

This talk describes how open source Hue was built in order to provide a better Hadoop User Experience. The underlying technical details of its architecture, the lessons learned and how it integrates with Impala, Search and Spark under the cover will be explained.

The presentation continues with real life analytics business use cases. It will show how data can be easily imported into the cluster and then queried interactively with SQL or through a visual search dashboard. All through your Web Browser or your own custom Web application!

This talk aims at organizations trying to put a friendly “face” on Hadoop and get productive. Anybody looking at being more effective with Hadoop will also learn best practices and how to quickly get ramped up on the main data scenarios. Hue can be integrated with existing Hadoop deployments with minimal changes/disturbances. We cover details on how Hue interacts with the ecosystem and leverages the existing authentication and security model of your company.

To sum-up, attendees of this talk will learn how Hadoop can be made more accessible and why Hue is the ideal gateway for using it more efficiently or being the starting point of your own Big Data Web application.

The video of the talk

↧