Tuesday, October 21, 2014

Brew trouble w/ Yosemite (gcc47 and subversion)

Mileage may vary on this one, but after upgrading to Yosemite, I encountered two issues with brew.

The first was with gcc47.  After running, "brew upgrade".  It got stuck on the gcc47 formula, which error'd with:

==> Upgrading gcc47
gcc47: OS X Mavericks or older is required for stable.
Use `brew install devel or --HEAD` for newer.
Error: An unsatisfied requirement failed this build.

I decided it was better to move to gcc48 than to bother trying to get gcc47 working. To do that, just install gcc48 and remove gcc47 with:

➜  ~  brew install gcc48
➜  ~  brew uninstall gcc47
That fixed that.  Then, I ran into an issue with subversion, which relied on serf.  The failed to compile with:
#include <apr_pools .h="">
1 error generated.
scons: *** [context.o] Error 1
scons: building terminated because of errors.
There is an open issue with brew for that one. (https://github.com/Homebrew/homebrew/issues/33422)

But you can get around the issue by installing the command line tools for xcode with:

➜ xcode-select --install

If you run into the same issues, hopefully that fixes you.

Example: Parsing tab delimited file using OpenCSV

I prefer opencsv for CSV parsing in Java.  That library also supports parsing of tab delimited files, here's how:

Just a quick gist:

Tuesday, October 14, 2014

Sqoop Oracle Example : Getting Started with Oracle -> HDFS import/extract

In this post, we'll get Sqoop (1.99.3) connected to an Oracle database, extracting records to HDFS.

Add Oracle Driver to Sqoop Classpath

The first thing we'll need to do is copy the oracle JDBC jar file into the Sqoop lib directory.  Note, this directly may not exist.  You may need to create it.

For me, this amounted to:
➜  sqoop  mkdir lib
➜  sqoop  cp ~/git/boneill/data-lab/lib/ojdbc6.jar ./lib

Add YARN and HDFS to Sqoop Classpath

Next, you will need to add the HDFS and YARN jar files to the classpath of Sqoop.  If you recall from the initial setup, the classpath is controlled by the common.loader property in the server/conf/catalina.properties file.  To get things submitting to the YARN cluster properly, I added the following additional paths to the common.loader property:


Note, the added paths.

*IMPORTANT* : Restart your Sqoop server so it picks up the new jar files. 
(including the driver jar!)

Create JDBC Connection

After that, we can fire up the client, and create a connection with the following:

bin/sqoop.sh client
sqoop> create connection --cid 1
Creating connection for connector with id 1
Please fill following values to create new connection object
Name: my_datasource
Connection configuration
JDBC Driver Class: oracle.jdbc.driver.OracleDriver
JDBC Connection String: jdbc:oracle:thin:@change.me:1521:service.name
Username: your.user
Password: ***********
JDBC Connection Properties:
There are currently 0 values in the map:
Security related configuration options
Max connections: 10
New connection was successfully created with validation status FINE and persistent id 1

Create Sqoop Job

Next step is to make a job.  This is done with the following:

sqoop> create job --xid 1 --type import
Creating job for connection with id 1
Please fill following values to create new job object
Name: data_import

Database configuration

Schema name: MY_SCHEMA
Table name: MY_TABLE
Table SQL statement:
Table column names:
Partition column name: UID
Nulls in partition column:
Boundary query:

Output configuration

Storage type:
  0 : HDFS
Choose: 0
Output format:
Choose: 0
Compression format:
  0 : NONE
Choose: 0
Output directory: /user/boneill/dump/

Throttling resources

New job was successfully created with validation status FINE  and persistent id 3

Everything is fairly straight-forward. The output directory is the HDFS directory to which the output will be written.

 Run the job!

This was actually the hardest step because the documentation is out of date. (AFAIK)  Instead of using "submission", as the documentation states.  Use the following:

sqoop> start job --jid 1
Submission details
Job ID: 3
Server URL: http://localhost:12000/sqoop/
Created by: bone
Creation date: 2014-10-14 13:27:57 EDT
Lastly updated by: bone
External ID: job_1413298225396_0001
2014-10-14 13:27:57 EDT: BOOTING  - Progress is not available

From there, you should be able to see the job in YARN!

After a bit of churning, you should be able to go over to HDFS and find your files in the output directory.

Best of luck all.  Let me know if you have any trouble.

Monday, October 13, 2014

Sqoop 1.99.3 w/ Hadoop 2 Installation / Getting Started Craziness (addtowar.sh not found, common.loader, etc.)

We have a ton of data in relational databases that we are looking to migrate onto our Big Data platform. S We took an initial look around and decided Sqoop might be worth a try.   I ran into some trouble getting Sqoop up and running.  Here in lies that story...

The main problem is the documentation (and google).  It appears as though Sqoop changed install processes between minor dot releases.  Google will likely land you on this documentation:

That documentation mentions a shell script, ./bin/addtowar.sh.  That shell script no longer exists in sqoop version 1.99.3.  Instead you should reference this documentation:

In that documentation, they mention the common.loader property in server/conf/catalina.properties.   If you haven't been following the Tomcat scene, that is the new property that allows you to load jar files onto your classpath without dropping them into $TOMCAT/lib, or your war file. (yuck)

To get Sqoop running, you'll need all of the Hadoop jar files (and the transitive dependencies) on the CLASSPATH when Sqoop/Tomcat starts up.  And unless, you add all of the Hadoop jar files to this property, you will end up with any or all of the following CNFE/NCDFE exceptions in your log file (found in server/logs/localhost*.log):

java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobClient
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration

Through trial and error, I found all of the paths needed for the common.loader property.  I ended up with the following in my catalina.properties:


That got me past all of the classpath issues.  Note, in my case /Users/bone/tools/hadoop was a complete install of Hadoop 2.4.0.

I also ran into this exception:

Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/

That path has to point to your Hadoop conf directory.   You can find this setting in server/conf/sqoop.properties.  I updated mine to:
(Again, /Users/bone/tools/hadoop is the directory of my hadoop installation)

OK ---  Now, you should be good to go!

Start the server with:
bin/sqoop.sh server start

Then, the client should work! (as shown below)

bin/sqoop.sh client
sqoop:000> set server --host localhost --port 12000 --webapp sqoop
Server is set successfully
sqoop:000> show version --all
client version:
  Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b
  Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013
server version:
  Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b


From there, follow this:

Happy sqooping all.

Tuesday, October 7, 2014

Diction in Software Development (i.e. Don't be a d1ck!)

Over the years, I've come to realize how important diction is in software development (and life in general).  It may mean the difference between a 15 minute meeting where everyone nods their heads, and a day long battle of egos. (especially when you have a room full of passionate people)

Here are a couple key words and phrases, I've incorporated into my vernacular.  Hopefully, these will help out next time you are in an architecture/design session. (or a conversation with your significant other =)

"I appreciate X":
Always use the phrase, "I appreciate that..." in response to a point.  But more importantly, *mean* it. It is an age-old adage, but when talking, it is best to listen.  Once you've heard the other party, try to understand and appreciate what they are saying.  Then, let them know that you appreciate their point, before adding additional information to the conversation.  (tnx to +Jeff Klein for this one)

"I am not passionate about X"
To drive consensus, I try to hold focused design discussions.  During those discussions, I'd try to squash tangential topics.  I used to say, "I don't care about that, do whatever, we're focused on X". Obviously, that would aggravate the people that did care about X.  These days, I use the phrase, "I am not passionate about that...".  People have different value-systems. Those value systems drive different priorities.  It is important to acknowledge that, while also keeping discussions focused.  (tnx to +Bob Binion for this one)

"What is the motivation / thought process behind X?"
People enter meetings with different intentions.   Those intentions often drive their positions.  It is tremendously important to understand those motivations, especially in larger companies or B2B interactions where organizational dynamics come into play.   In those contexts, success is often gauged by the size of one's department.  Thus, suggesting an approach that eliminates departments of people is a difficult conversation.  Sometimes you don't have the right people in the room for that conversation.  I've found that it is often useful to tease out motivations and the thought processes behind positions.  Then you can address them directly, or at least "appreciate"=) that motivations are in conflict. 

Eliminate the "But's"
This is a simple change, but a monumental difference in both diction and philosophy.  If you listen to *yourself*, you'll likely make statements like, "X is great, but consider Y".  You'll be surprised how many of these can actually be changed to, "X is great, *and* consider Y".    Believe me, this changes the whole dynamic of the conversation.  

AND, this actually begins to change the way you think...   

When you view the world as a set of boolean tradeoffs, "either you can have X or you can have Y", you never consider the potential of having X *and* Y.  Considering that possibility is at the root of many innovations.  (e.g.  I want multi-dimensional analytics AND I want them in real-time!) 

F. Scott Fitzgerald said, "The test of a first-rate intelligence is the ability to hold two opposed ideas in the mind at the same time, and still retain the ability to function."  These days, this is how I try to live my life, embracing the "what if?" and then delivering on it. =)

(tnx to +Jim Stogdill and +William Loftus for this perspective)