Thursday, July 16, 2015

Cloud Formation on AWS for Cassandra + HPCC


If your primary objective is to setup a simple Cassandra cluster, then you probably want to start here:
http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installAMI.html

However, if you have an existing AWS cluster to which you want to add Cassandra, then read on.

In my case, I wanted to add Cassandra to an existing HPCC cluster.  More specifically, I wanted to be able to spin-up an HPCC + Cassandra cluster with a single command.  To accomplish this, I decided to add a bit of python scripting on top of Cloud Formation.

Amazon has a facility called Cloud Formation.  Cloud Formation reads a JSON template file, and creates instances as described in that file. (pretty slick)  Within that JSON, you can execute shell commands that do the heavy lifting.  The JSON file can define parameters that the administrator can then provide via the management console, or via AWS CLI.
(IMHO, I suggest installing AWS CLI)

Running a Cloud Formation

First, I started with Tim Humphrie's EasyFastHPCCoAWS.  That cloud formation template is a great basis.  It installs AWS CLI, and copies the contents of an S3 bucket down into /home/ec2-users.  Have a look at the template file.  To get that up and running, it is a simple matter of creating a PlacementGroup, a KeyPair, and an S3 bucket, into which you copy the contents of the github repo.  For simplicity, I named all of those the same thing: "realtime-hpcc".

Now, with a single command, I can fire up a low-cost HPCC cluster with the following:

aws cloudformation create-stack --capabilities CAPABILITY_IAM --stack-name realtime-hpcc --template-body https://s3.amazonaws.com/realtime-hpcc/MyHPCCCloudFormationTemplate.json --parameters \
   ParameterKey=HPCCPlacementGroup,ParameterValue=realtime-hpcc \
   ParameterKey=HPCCPlatform,ParameterValue=HPCC-Platform-5.2.2-1 \
   ParameterKey=KeyPair,ParameterValue=realtime-hpcc \
   ParameterKey=MasterInstanceType,ParameterValue=c3.2xlarge \
   ParameterKey=NumberOfRoxieNodes,ParameterValue=1 \
   ParameterKey=NumberOfSlaveInstances,ParameterValue=1 \
   ParameterKey=NumberOfSlavesPerNode,ParameterValue=2 \
   ParameterKey=RoxieInstanceType,ParameterValue=c3.2xlarge \
   ParameterKey=ScriptsS3BucketFolder,ParameterValue=s3://riptide-hpcc/ \
   ParameterKey=SlaveInstanceType,ParameterValue=c3.2xlarge \
   ParameterKey=UserNameAndPassword,ParameterValue=riptide/HIDDEN

Note, I specified the template via https url.  I also specified a stack-name, which is what you'll use when querying AWS for status, which you can do with the following command:

aws cloudformation describe-stacks --stack-name realtime-hpcc

With that you get a nice, clean JSON back that looks something like this:

{
    "Stacks": [
        {
            "StackId": "arn:aws:cloudformation:us-east-1:633162230041:stack/realtime-hpcc/e609e0b0-2595-11e5-97b7-5001b34a4a0a",
            "Description": "Launches instances for fast executing HPCC on AWS. Plus, it sets up and starts HPCC System.",
            "Parameters": [
                {
                    "ParameterValue": "realtime-hpcc",
                    "ParameterKey": "KeyPair"
                }...
            ],
            "Tags": [],
            "CreationTime": "2015-07-08T17:22:24.461Z",
            "Capabilities": [
                "CAPABILITY_IAM"
            ],
            "StackName": "realtime-hpcc",
            "NotificationARNs": [],
            "StackStatus": "CREATE_IN_PROGRESS",
            "DisableRollback": false
        }
    ]
}

The "StackStatus" is the key property.  You'll want to wait until that says, "CREATE_COMPLETE".
Once it completes, you can go into the management console and see your EC2 instances.

If something went wrong, you can go have a look in /var/log/user-data.log.  Tim's template conveniently redirects the output of the shell commands to that log file.

Installing Cassandra 

NOW -- to actually get Cassandra installed on the machines, I simply forked Tim's work and altered the Cloud Formation template to include the datastax repo and a yum install of Cassandra.   And the next time I created my cluster: poof magic voodoo, Cassandra was installed!

Next I needed to configure the Cassandra instances into a cluster.   At first, I tried to do this using a shell script executed as part of the cloud formation, but that proved difficult because I wanted the IP addresses for all the nodes, not just the one on which the script was running.  I shifted gears and decided to orchestrate the configuration from python after the cloud had already formed.

I wrote a quick little python script (configure_local_cassandra.py) that takes four parameters: the location of the cassandra.yaml file, the cluster name, the private IPs of the Cassandra nodes, and the IP of the node itself.   The python script updates the cassandra config, substituting those values into the template file.  I added this to the S3 bucket, and Cloud Formation took care of deploying the template and the python script to the machines.  (thanks to Tim's template)

Configuring Cassandra 

With that script and the template in place on each machine, the final piece is the script that gathers the IP addresses for the nodes and calls the python script via ssh.  For this, we use the aws ec2 cli, and fetch the JSON for all of our instances.  The aws ec2 command looks like this:

aws ec2 describe-instances

I wrote a python script (configure_cassandra_cluster.py) that parses that JSON and run commands on each of the nodes via ssh.

To make everything simple, I added a bunch of shell scripts that wrap all the command lines (so I don't need to remember all the parameters).  The shell scripts are as follow

Convenience Scripts

To keep simple, I also added a bunch of shell scripts that wrap all the command lines (so I don't need to remember all the parameters).   The shell scripts allow you to create a cluster, get the status of a cluster, and delete a cluster using a single command line:

create_stack.sh, get_status.sh, delete_stack.sh

(respectively)

Putting it all together...

To summarize, the create_stack.sh script uses aws cloudformation to create the cluster.
Then, you can watch the status of the cluster with, get_status.sh.
Once formed, the configure_cassandra_cluster.py script installs, configures and starts Cassandra.

After that, you should be able to run ecl using Casssandra!

Feel free to take these scripts, and apply them to other things.  And kudos to Tim Humphries for the cloud formation template.


Wednesday, June 10, 2015

Amazon Echo : Syntax, Semantics, Intents and Goals: NLP over time.

So I caved.  Even with all my Apple paraphernalia, I bought an Amazon Echo.  I've had it for a little over a week, and I'm hooked.  We use it to play music, check the weather, and set timers -- all of the out of the box functionality.  You may think, "It's just Siri in a room".  And that is one perspective.  From a different perspective, Alexa (the name you use to interact with Echo) will change everything, and bring the Internet of Things (IoT) to the masses.  Regardless, I'm just amazed at how far we've come with NLP.

Along those lines, I recently had an academic discussion this week around syntax and semantics in machine-to-machine interfaces (APIs), and how that correlates to user/consumer intent -- specifically whether or not you can infer user intent from captured REST calls.  (FWIW -- I believe you can infer intent (at least some notion of it), although it might only be a portion of the user's goal, and may be unintentionally misaligned with the semantics of the call.)

With that conversation fresh in my mind, I found it amusing that the developer API for Echo specifically calls out "intent" and defines it as:

"In the context of Alexa apps, an intent represents a high-level action that fulfills a user’s spoken request. Intents can optionally have arguments called slots. Note that intents for Alexa apps are not related in any way to Android intents." - Echo Developer : Getting Started Guide

That made me nostalgic.  I loved my days in NLP, and it's absolutely phenomenal to see how things have played out over the years...

(NOTE: what follows is almost entirely self-serving, and you may not get anything out of it.  I am not responsible for any of the time you lose in reading it ;)

I think I was eighteen when I started working at the Natural Language Processing (NLP) group within Unisys.  I was one of the many developers building those terrible voice recognition systems on the other end of the phone when you dialed in for customer service and received anything but that. We frustrated our end-users, but -- I got to work with some amazing people: Debbie Dahl, Bill Scholz, and Jim Irwin.

And we thought we were smart.  We built all sorts of tools that helped map text into actions ("intents"),  and new fangled web servers for voice recognition.  They were good times, but frankly we were only inflicting pain on people. Voice recognition wasn't there yet.   We spent our time trying to engineer the questions properly, to guide users to answer with certain terms to make it easy on the voice recognition system.  (The year was (1995-1999ish)

For a time after that, I wanted nothing to do with voice recognition.  I still loved NLP though, and went on to build a system to automate email routing and responses for customer service. That worked because it was numbers game.  If the system was confident enough to answer 50% of the customer inquires, it meant they didn't need humans to respond to that subset, which saved moolah.  We went on to extend that into a real-time instant messaging over the web (Kana IQ).  Again, good times and lots of brain work/patents in the process, and we congratulated ourselves for figuring out how to map a Bayesian inference engine on to a grammar. (1999-2001ish)  But there is no way we could have put that system directly in the hands of consumers with no human support.

However -- it was during this time that I started appreciating the difference between: Syntax, Semantics, Intents and Goals.  Here are the straight-up definitions:

Syntax : the arrangement of words and phrases to create well-formed sentences in a language.
Semantics : the branch of linguistics and logic concerned with meaning.
Intent : the reason for which something is done or created or for which something exists.
Goal : the object of a person's ambition or effort; an aim or desired result.

Consider the process of taking characters in a string as input and converting it into actions that help the user achieve their goal.  First, you need to consider syntax.  In this part of the process, you are converting useless string of characters into related tokens.   There are lots of ways to do this, and multiple pieces to the puzzle (e.g. Part of Speech Tagging), etc.  Back at Brown, I had the pleasure of studying under Eugene Charniak, and his book on Statistical Natural Language Learning became a favorite of mine (after initially hating it for a semester ;).  Ever since that course, I've fallen back on Context-Free-Grammars (CFGs) and chart-parsing to attempt to relate word tokens to each other in a sentence.

Once you have related tokens, and you know the parts of speech (ADJ, NOUN, etc), and how they relate (ADJ _modifies_ NOUN) via chart-parsing, you can attempt to assign semantics.  IMHO -- this is the hard part.  

After Kana, I took a job with a company that recorded patient/doctor conversations, transcribed them, and then attempted to perform computational linguistics on them.  If you can imagine, trying to discern between a mention of a *symptom* and a *side-effect* is incredibly difficult.  You not only need solid parsing to associate the terms in the sentence, but you need a knowledge base to know what those terms mean, and the context in which they are being used.  In this phase, we are assigning meaning to the terms.  (i.e. semantics)  Sure, we used Hadoop for the NLP processing, but more horse power didn't translate into better results... (now 2008-2010ish)

Even assuming you can properly assign semantics,  you may misinterpret the intent of a communique.  Even with human-to-human communication, this happens all the time.  English is a horrible language.  And the same thing can happen with machines.  However, you have to assume that more often than not, assuming you can parse the sentence (i.e. assign syntax), and you can interpret meaning (i.e. semantics), you can infer some notion of intent from the textual gibberish. Otherwise, communication is just fundamentally broken.

With all this in mind -- back to Echo.

Amazon nailed it.  I see the 20 years of my NLP frustration, solved in a 12" cylinder.  It rarely misses on the voice recognition.  And they have a fantastic engine taking phonemes to "intent", allowing developers to plugin at that highest layer.  Boo yah.  That's empowering.  I might just build an app for Echo.  Maybe an AWS integration --  "Alexa, please expand my AWS cluster by 10 nodes".   It'd be therapeutic. ;)