Category Archives: Machine Learning

Machine Learning

Anatomy of a Robot

A while back, I wrote a blog post about spherical robots.   I had taken it upon myself to learn a bit more about robotics with the intention of building a simple autonomous robot.  Well, over two years later, I’m at a point where I can do some actual robotics work.  I look back on what I’ve learned… and the rabbit hole of a trek that led me here.

The Coursera Rabbit Hole


What goes into a robot?  Well, naively I thought that you hook up servos and sensors tosome kind of micro controller and away you go.  It’s the Lego Mindstorms version of Robotics.  Of course, that *IS* one way to look at robots… but really, that’s just the beginning… the “Hello World” program of building robots.  I wanted to build something a little more sophisticated than the “recommended age 12-adult” crowd.

itunes-artworkWell, for the answer we look to the Control of Mobile Robots class offered on Coursera.  In this class, Dr. Magnus Egerstedt introduces Control Systems.  The class itself does not get into the hardware of building robots, but digs into the abstraction layers necessary for successfully modeling and controlling things that interact with the real world. 

What exactly do I mean when I say “Control things?” Well, think about yourself for a moment.  If you’re standing and someone shoves you, you are able to react in a way to keep yourself stable and standing… or simply put you’re in control of your body.  The act of keeping yourself upright is a complex set of muscle movements that need to be carried out correctly but you don’t need to think about how to control each muscle… you just do it.  The instinctive impulse to lean or step is handled by your innate control system.

itunes-artworkIt’s not enough, however, for a robot to be “controllable”.  My goal is to build a robot that’s autonomous.   That requires some form of higher level artificial intelligence.  It turns out that Coursera offers another class geared toward exactly this: Artificial Intelligence Planning!  In this class, Dr. Gerhard Wickler and Professor Austin Tate take you through a survey of programmatic problem solving algorithms.  I was amused to learn that like all other computer science problem, artificial intelligence problem solving comes down to a search algorithm.

At the end of this course, you’ll be able to write a program that given some set of circumstances and corresponding set of possible actions, it will figure out what to do to accomplish its goals; assuming that some possible set of actions can achieve the goal.

small-icon.hoverThis lead me to the next problem… perception.  An autonomous robot has sensors and it needs to be able to figure out some “state” of the universe in order for it to use its problem solving capabilities.  How on earth do you map images/sounds/echolocation to logical states of the universe?  Through machine learning.  As it turns out, Coursera offers a LOT of classes on exactly this.  The most notable of these classes is Coursera’s co-founder‘s class on Machine learning…  Here, you learn all kinds of algorithms for automatically classifying and identifying logical states based on noisy or confusing input.

itunes-artworkA more advanced class that I really enjoyed focused on state of the art Neural Networks.  The class is called Neural Networks for Machine Learning and is taught by Dr. Geoffrey Hinton.  This class goes into great depth on various kinds of Neural Networks.  This class totally blew my mind.  I have no doubt that the correct application of neural nets with the right kind of self-motivating planner will lead to formidable AIs.

Putting it all together

First let’s talk about hardware.  Below is a list of hardware that I’m going to use and I’ll parallel it with what I feel may be the human anatomy counter-part.

IMG_0465I’m going to use an Arduino Uno as the primary interface with all of my robot’s actuators.  It represents the spinal cord and instinctive nervous system of the robot.   The Arduino is a very simple microcontroller that isn’t terribly fast.  It also doesn’t have much in the way of memory.  It does have extremely easy interfaces with motors and sensors.  This makes it ideal for running a closed loop System. (See Control of Mobile Robots class for details).

raspberrypiConnected to the Arduino will be a Raspberry Pi. The Pi will be the brains of the Robot. All higher order problem solving will occur here.  The brain and the spinal cord will talk to each other using SPI.  Naturally the Raspberry Pi will be the master of the SPI bus.  As the robot gets more complex, it might be necessary to attach more than one microcontroller (maybe not all arduinos)… especially if I start working with more complex sensors.

shapelockThe supports and overall skeleton of my robot will be created with Shape Lock.  It’s a material that can be melted, shaped and re-used over and over.  It claims to be machinable (using my dremmel) and durable.  I imagine that if I need stronger load bearing parts, I can prototype in shape lock and carve some other material based on the prototype.  Wood is a likely candidate.

Okay. The big pieces are out of the way.  Now the fun stuff.  What sensors / servos will I use?  I have a variety of electric motors, solenoids and steppers that I picked up from Adafruit. It’s likely that my first robot will be a simple differential drive deal.. but eventually I’d like to go back to my ideas in the original blog post and create a spherical robot.   In the end, the actual drive system and sensors don’t matter that much… they’re just the accessories of the Mr. Potato head.  All interchangeable.

Perceptrons as a digit classifier

perceptron-diagIt’s been a while since I posted.  I’ve continued to mess around with Machine Learning.  As a seemingly natural extension of Andrew Ng‘s class, I took Geoffrey Hinton‘s Neural Networks for Machine Learning class.  I have to say, it totally blew my mind.  If we haven’t already gotten to a near-sentient AI, I’m convinced that we’re very close.

For my own exploration of Machine Learning, I’ve decided to work through the family of neural network types in a similar order to the sequence presented in Geoffrey Hinton’s class.  That means, that I start with the very first type of neural network, the Perceptron.

A Perceptron takes multiple input values and makes a single logistic decision about it.  I took a collection of MNIST digits to see how well a Perceptron would do at identifying them.  You’d imagine that this would be a natural fit for a Perceptron as digits are reasonably distinct… but I discovered that there really is not enough information stored within the weights of a perceptron to make it an effective classifier for MNIST.

Here’s what the perceptron weights looked like for number 3. White means positive coefficients and black means negative.  Gray means near-zero.  The images read from left to right and represent the state after increasing batches of 1000 labeled cases.


Here you can see that the more batches you run against the perceptron, the more complex the weights get.  These complexities are probably overfitting. I thought, meh. Let me see how well it performed.  Now, as 3s only represent about 10% of the label data, if I always guess “Not a 3″, I would see a 90% correct guess.  This means that I wanted a value higher than 90%. It turns out that my classifier scored 90.26% which caused me to suspect that my perceptron had guessed all 0s.   From actually looking at the guesses, though, I realized that no… the perceptron was actually guessing based on data, so I needed to dig a bit more into what it was doing.   Here’s the distribution of its answers:

Correct Negative: 4034
Correct Positive: 479
False Negative: 466
False Positive: 21

So for real, it was guessing right about 90% of the time.  What screws it over ends up being its false negatives.  blah.

I decided to try something.  Instead of weighting positive and negative cases equally when determining weights, I thought that I would add a term called alpha which weights positive and negative cases differently.  In this case, since the positive case only happens one in 10 times, I wanted to weight the negative cases by 1/10th, so I set alpha to 0.1.


Here is an image of the coefficients using the values weighted by alpha.  You can see that, especially in the earlier batches that the shape of the 3 is a lot fuzzier and we have less complexity, even in the later iterations.  So how well did THIS set of coefficients work?

Correct Negative 3962
Correct Positive 472
False Negative 538
False Positive 28

It seems like it did a lot worse.  It had a lot more false negatives. The false positives and correct positives also dropped but not by much.  It seems like the fuzzier coefficients caused the perceptron to be less certain about a particular trial.


Perception and PCA

In Machine Learning, you can use Principal Component Analysis as a lossy transform that lets you reduce the complexity of data in order to improve the performance of computationally expensive operations.

So like, what does PCA actually do? It crunches multiple dimensions into a single dimension.  Consider for a moment the idea of BMI, or Body/Mass Index.  It takes 2 dimensions, mass and height and crunches them into a single dimension which somewhat describes both.  The diagram to the right kind of expresses this idea of shrinking the number of dimensions.  Here, Dimension X is a conglomerate of Dimensions A and B such that position x corresponds to point a on Dimension A and point b on Dimension B.

As I’ve learned more about algorithmic learning, I’ve found myself believing that really, we humans learn in similar ways.  This led me to the observation that perhaps the 4 spacial dimensions of space-time may actually not actually be “real” dimensions at all. Maybe our brains take in the high-dimensionality described by String Theory and collapse it, through electro-chemical processes similar to PCA into the 4 dimensions we interact with.

It could be that the world we live in is far more complex than we perceive. We could be thinking in the blue Dimension X when really we exist in A and B.

Evil Robots – It’s all in the LEDs

By picking up machine learning and micro-controllers, I’ve found myself at a juncture.  A natural spot where these two hobbies naturally coincide.  Robots.

I can honestly say that I’ve never really thought much about Asimov or his three laws of Robotics until this point.  So… what are these three laws? and what the hell do they do?

  1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  2. A robot must obey the orders given to it by human beings, except where such orders would conflict with the First Law.
  3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws.

They make perfect sense right? These laws make it clear that robots shouldn’t harm humans.  The thing is, though… that a huge body of sci-fi exists which totally disregard putting the three laws.  Seriously.  I bet you could name at least three evil robots.  Why didn’t their creators program in the 3 laws?  Well, primarily, it’d make for a pretty boring plot… but I actually think that in real life, we won’t see robots programmed with the three laws because well… these laws are hard to code! I mean, how do you even begin to go about coding them?  I have a more obvious and far easier way of preventing evil robots from taking over — DON’T USE RED LEDS!

Let’s take a look at some well known evil robots and let’s see if we can identify a common characteristic. Ok?

Probably the first evil AI that comes to mind is the HAL 9000.  This lovable computer from Arthur C. Clarke’s 2001 attempts to kill the entire crew of the spacecraft, Discovery.  It would have succeeded too, if it wasn’t for a meddling kid named Dave Bowman.

I’m not a huge Kubrick fan, but I must admit that he did an excellent job of making HAL seem very creepy.  You’ll note that the camera mysteriously has a bright red LED glowing right smack in the middle.  I’m not really sure how a camera can operate with light bring projected through its aperture, but meh. It’s the future… oh wait… it should have happened 11 years ago… but that’s a tangent. Bottom line. Evil AI. Red LED.


The second Robot AI I’d like to point out came into being about a decade after 2001: A Space Odyssey, the Cylon.  You will notice the scanning visual thingy on this robot? Yes. Red LED. It marks an entire race of sentient AIs whose entire purpose is to eradicate humanity from the cosmos.  Okay. I know what you’re thinking… didn’t KITT also have a Red LED sensor? KITT wasn’t evil! Hah.  The original version of that AI was KARR which WAS evil. KITT is just pretending to be good to piss off it’s evil twin.

Skipping forward to the 80s, we come across The Terminator. This robot is doubly evil.  It’s totally got not one but TWO red LEDs.  The Terminator is the progeny of an AI who tricked the superpowers of earth into engaging in nuclear war.  That’s not just evil… it’s totally passive aggressive too.  I’m pretty sure that SkyNET had red LEDs all over it.

Need more convincing? Okay. Let’s go.

Here are the “Squiddies” from the Matrix.  These AIs are all about rending humans limb from limb. They are basically ruthless killing machines.  Note the sheer number of red lights. I rest my case.


So like, what the hell? Why use Red LEDs at all?  It probably has to do with power. I mean, power corrupts right?  Let’s take a look at one more example.  This should hammer home the corruptive power of the red LED.

Otto Octavius, a mild mannered scientist creates an 8-limbed exoskeleton that’ll revolutionize the world.  He uses red LEDs for their power… but cancels it out by building a rather fragile looking anti-evil circuit into the suit.  Learn from Otto’s mistake. If you make yourself some kinda futuristic, armored, super-powered exoskeleton? DON’T MAKE THE ANTI-EVIL CIRCUIT THE MOST FRAGILE PART!

So okay.  What happens if we don’t use red LEDs? Are there any example of powerful robots who don’t have them? Maybe we *NEED* to use red LEDs… I’ll leave you with one final image.  You come to your own conclusions.


Approximating complex functions using Linear Regression

So far I’ve only been validating my Linear Regression algorithm using actual linear data.

Something that was pointed out in Andrew Ng‘s class was.. though linear regression is inherently… well… linear, the dataset that you try to learn against need not be data scattered around a straight line.  It all depends on what features you use.

I was curious to see what kind of functions I could have skynet approximate… so I chose two.  e^x and sine(x).  You can see in the picture above, I generated a largely scattered dataset around a small section of e^x. The blue line running through the center are the values generated by my hypothesis.  The features used in this case were created by raising x to higher powers (in this case, up to x^15)… anotherwords, my features were: x, x^2, x^3 … x^15.

You can see that the hypothesis here is definitely NOT linear.  Pretty impressive.

Here’s a much tougher function to approximate… however the 16 higher order features came pretty close to htiting sin(x).  I bet if I included factorial terms in the feature set, linear regression would have figured out the taylor series that makes up sin(x).

I played around with the regularization term, lambda in each of these plots, but surprisingly, it didn’t affect the hypothesis that much.  It might be because the test data that I generate is not terribly ambiguous, so even regularized, the cost functions don’t change shape very much… they’re just translated upward slightly.

Added Logistic Regression to Skynet

Skynet has continued to progress.  I’ve now validated my Logistic Regression code.

To the left you can see some label data plotted.  The blue is labeled true and red false.

This iteration of code improves the Sketchpad class significantly.  Arrow keys now allow you to move forward and backward between the different plots and the name of the actively displayed graph is displayed in the upper left corner.  Previously, the Sketchpad only drew circles, but I’ve added the ability for it to draw squares and Xs (took a page from Octave).

Okay, so how did the logistic regression do?  You can see to the right the hypothesis plotted.  It comes pretty darn close with 98.7% correctly labeled with a modest iteration count of 1.5k

Maybe it was because a large portion of the code is shared, I found that once the underlying math was solid, Logistic Regression just worked.   If I crank up the iterations to 10x, I get a 99.9% of my data correctly labeled.  At 15x, I achieve 100%.

Obviously this is an extremely simple dataset.  It was generated by labeling anything to the right of a slope to be true… so for more complex datasets, I’m sure 100x iterations or more will be necessary.  I definitely feel the need to crack open the Numerical Recipes book and find more sophisticated function minimizing algorithm.

At 100k iterations, the Gradient Descent process took a noticeably longer time to compute for Logistic Regression than a comparable Linear Regression counterpart.  I’m guessing that this has to do with a more mathematically intense Sigmoid function.   At some point, I’ll go through and profile the hell out of the code to see where my hotspots are.   My assumption is that my inefficient Matrix math will be the leading offender.. but complex math like Sigmoid or Log are likely good low hanging fruit for optimization.


Gradient Descent and Normalized Data

Skynet is coming along nicely. I’ve gotten the chance to build out a fairly comprehensive Linear Regression object model.

The actual learning part of Linear Regression centers around finding the minimum value of something called the Cost Function; a function which measures how inaccurate the algorithm guesses at the answers.  The method that I’m using for finding the minimum value for the cost function is called Gradient Descent.  One thing that I noticed, though, was that Gradient Descent was taking FOREVER to converge to an optimal theta… (theta is the set of parameters which lead to a minimal cost function)  especially if the learning rate was low.  I had no idea just how slow it could be.

On the right, you can see a plot of my Hypothesis.  The hypothesis is the function that my AI is using to guess against a synthetic data set. The data set is very linear and really, an ideal hypothesis would run right down the center of the line.

In this case, Gradient Descent did not converge in time and it hit its maximum iterations before it could reach a reasonable value of theta.

Normally my options are I could increase my learning rate or add more iterations.   I wanted to see, however what using data normalization would do.

Normalizing data takes each data point, subtracts the mean of the data set and divides by the standard deviation (you can optionally divide by the range).   This gives you a dataset that’s centered around y=0 and has a much smaller range of y values.   So how does this change whether or not Gradient Descent converges? Let’s see…

Here, you see the convergence rates of Gradient Descent.  The red values represent un-normalized data and the blue represents the same dataset, but normalized.

I should have cut off the first few data points because they’re totally outliers, but you can see that the red line goes all the way off to the right which means that it hit its maximum number of iterations.  The blue line, however stops short.  It converged before maxing out its iterations.

How does the normalized hypothesis look plotted against the data? Take a look!

You can see that the hypothesis for the normalized data looks far more correct.

One thing to note before you dive into normalized data… whatever dataset you use your learned thetas on must also be normalized… otherwise you’ll be feeding your Linear Regression algorithm apples when it learned on oranges.  That’s very likely not going to produce desired results.

Applying Machine Learning

I’ve successfully completed Andrew Ng’s Machine Learning class. I feel pretty spiffy that I can now say that I’ve implemented a Neural Network… though right now it’s just in Octave.

In his concluding video, Andrew asked that we go forth into the world and do cool things with machine learning… and I plan on doing exactly that. I can think of several interesting things that I can build in the context of my work at Topsy Labs… but it’s unlikely that any work context stuff will filter out in a way that the public can see directly.. so I plan on creating a few personal projects around machine learning.

The first thing that I want to do is go through the course material and actually implement most of it in Java.  For that purpose, I’ve checked in the start of said project into github as part of my incubator repo. I’ve affectionately named the sub-project “SkyNET” after the largely misunderstood AI from the Terminator Franchise.  The code itself won’t have specific spoilers from the exercises… that’d be bad form… but they will have brute force implementations of the algorithms taught in the course.

I think that actually implementing things from scratch, I’ll get a better handle of how and why some of this stuff works.  For this initial version of Skynet, I am not going to allow myself the luxury of third party libraries.  The point isn’t to create a well optimized machine learning system… It’s a learning exercise.  Part two of Skynet will be to re-implement the whole thing again in C++.  This time I’ll incorporate a bunch of well optimized libraries like Boost.

Those of you who want to troll through my Java code… here’s the direct link to Skynet.