When is Math.abs(x) < 0?

I was monitoring a long running Hadoop job yesterday at work when I began noticing one of the tasks which make up that job began failing for an odd reason: Illegal Partition.

For those of you not familiar with Hadoop jobs, part of it involves splitting up massive amounts of data into smaller, more manageable pieces; or partitioning the data.  The goal of partitioning is to get N roughly equal chunks where N is generally pretty big.

The specific illegal partition value that was coming up was negative. “But that’s not possible!” I thought to myself! The expression I was using to choose partitions used Math.abs(x), or the Absolute value of x.  Specifically:

partition = Math.abs(x) % N

Ok, now at a glance, you tell me how that could be negative… It turns out though, that for at least a critical value of x, Math.abs(x) will return negative: Integer.MIN_VALUE.

As it turns out, Integer.MIN_VALUE is the smallest value that a 32-bit number can hold in java: -2,147,483,648. and if you try to Math.abs(Integer.MIN_VALUE), the result is STILL Integer.MIN_VALUE.  Is this a bug? It turns out that it’s not.  If you read the javadoc for Math.abs(), it points out that:

“Note that if the argument is equal to the value of Long.MIN_VALUE, the most negative representable long value, the result is that same value, which is negative.”

Now, exactly why is that the case?  Well, for you to get the absolute value of Integer.MIN_VALUE, you’ve got to be able to represent the result as a 32-bit integer too… which is impossible, as the largest integer you could possibly represent is 2,147,483,647 which is one smaller than we need it to be.  Maybe this was coded so that the Integer.MIN_VALUE would be preserved?  I decided to look at the source.

So … here’s the implementation:

public static int abs(int a) {
    return (a < 0) ? -a : a;

Okay… that’s a much simpler than what I thought it’d be, however it does point out that it’s not just Math.abs(x) that has this problem… it’s any negation of Integer.MIN_VALUE.  Anotherwords, the following must be a true expression:

Integer.MIN_VALUE == -Integer.MIN_VALUE

Give it a shot if you don’t believe me, fire up eclipse and try it out!  In fact, eclipse will highlight the expression with a warning that says “Comparison of identical expressions”.

Okay… I’m not sure how i feel about this.  so WWCD? (What would C do?) … so I gave it a shot… Here’s a C program that exercises this exact question:

int main(int argc, char *argv[]) {
    int x = -2147483648;
    int y = -2147483647;
    printf("X: %d %d\n", x, abs(x));
    printf("Y: %d %d\n", y, abs(y));

What do you think the answer is?  It actually surprised me a little:

X: -2147483648 -2147483648
Y: -2147483647 2147483647

It seems like C behaves the same way.  Apparently this isn’t just a Java problem.  Bottom line? watch and test around -2,147,483,648. It’s a critical value.

Evil Robots – It’s all in the LEDs

By picking up machine learning and micro-controllers, I’ve found myself at a juncture.  A natural spot where these two hobbies naturally coincide.  Robots.

I can honestly say that I’ve never really thought much about Asimov or his three laws of Robotics until this point.  So… what are these three laws? and what the hell do they do?

  1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  2. A robot must obey the orders given to it by human beings, except where such orders would conflict with the First Law.
  3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws.

They make perfect sense right? These laws make it clear that robots shouldn’t harm humans.  The thing is, though… that a huge body of sci-fi exists which totally disregard putting the three laws.  Seriously.  I bet you could name at least three evil robots.  Why didn’t their creators program in the 3 laws?  Well, primarily, it’d make for a pretty boring plot… but I actually think that in real life, we won’t see robots programmed with the three laws because well… these laws are hard to code! I mean, how do you even begin to go about coding them?  I have a more obvious and far easier way of preventing evil robots from taking over — DON’T USE RED LEDS!

Let’s take a look at some well known evil robots and let’s see if we can identify a common characteristic. Ok?

Probably the first evil AI that comes to mind is the HAL 9000.  This lovable computer from Arthur C. Clarke’s 2001 attempts to kill the entire crew of the spacecraft, Discovery.  It would have succeeded too, if it wasn’t for a meddling kid named Dave Bowman.

I’m not a huge Kubrick fan, but I must admit that he did an excellent job of making HAL seem very creepy.  You’ll note that the camera mysteriously has a bright red LED glowing right smack in the middle.  I’m not really sure how a camera can operate with light bring projected through its aperture, but meh. It’s the future… oh wait… it should have happened 11 years ago… but that’s a tangent. Bottom line. Evil AI. Red LED.


The second Robot AI I’d like to point out came into being about a decade after 2001: A Space Odyssey, the Cylon.  You will notice the scanning visual thingy on this robot? Yes. Red LED. It marks an entire race of sentient AIs whose entire purpose is to eradicate humanity from the cosmos.  Okay. I know what you’re thinking… didn’t KITT also have a Red LED sensor? KITT wasn’t evil! Hah.  The original version of that AI was KARR which WAS evil. KITT is just pretending to be good to piss off it’s evil twin.

Skipping forward to the 80s, we come across The Terminator. This robot is doubly evil.  It’s totally got not one but TWO red LEDs.  The Terminator is the progeny of an AI who tricked the superpowers of earth into engaging in nuclear war.  That’s not just evil… it’s totally passive aggressive too.  I’m pretty sure that SkyNET had red LEDs all over it.

Need more convincing? Okay. Let’s go.

Here are the “Squiddies” from the Matrix.  These AIs are all about rending humans limb from limb. They are basically ruthless killing machines.  Note the sheer number of red lights. I rest my case.


So like, what the hell? Why use Red LEDs at all?  It probably has to do with power. I mean, power corrupts right?  Let’s take a look at one more example.  This should hammer home the corruptive power of the red LED.

Otto Octavius, a mild mannered scientist creates an 8-limbed exoskeleton that’ll revolutionize the world.  He uses red LEDs for their power… but cancels it out by building a rather fragile looking anti-evil circuit into the suit.  Learn from Otto’s mistake. If you make yourself some kinda futuristic, armored, super-powered exoskeleton? DON’T MAKE THE ANTI-EVIL CIRCUIT THE MOST FRAGILE PART!

So okay.  What happens if we don’t use red LEDs? Are there any example of powerful robots who don’t have them? Maybe we *NEED* to use red LEDs… I’ll leave you with one final image.  You come to your own conclusions.


Approximating complex functions using Linear Regression

So far I’ve only been validating my Linear Regression algorithm using actual linear data.

Something that was pointed out in Andrew Ng‘s class was.. though linear regression is inherently… well… linear, the dataset that you try to learn against need not be data scattered around a straight line.  It all depends on what features you use.

I was curious to see what kind of functions I could have skynet approximate… so I chose two.  e^x and sine(x).  You can see in the picture above, I generated a largely scattered dataset around a small section of e^x. The blue line running through the center are the values generated by my hypothesis.  The features used in this case were created by raising x to higher powers (in this case, up to x^15)… anotherwords, my features were: x, x^2, x^3 … x^15.

You can see that the hypothesis here is definitely NOT linear.  Pretty impressive.

Here’s a much tougher function to approximate… however the 16 higher order features came pretty close to htiting sin(x).  I bet if I included factorial terms in the feature set, linear regression would have figured out the taylor series that makes up sin(x).

I played around with the regularization term, lambda in each of these plots, but surprisingly, it didn’t affect the hypothesis that much.  It might be because the test data that I generate is not terribly ambiguous, so even regularized, the cost functions don’t change shape very much… they’re just translated upward slightly.

Added Logistic Regression to Skynet

Skynet has continued to progress.  I’ve now validated my Logistic Regression code.

To the left you can see some label data plotted.  The blue is labeled true and red false.

This iteration of code improves the Sketchpad class significantly.  Arrow keys now allow you to move forward and backward between the different plots and the name of the actively displayed graph is displayed in the upper left corner.  Previously, the Sketchpad only drew circles, but I’ve added the ability for it to draw squares and Xs (took a page from Octave).

Okay, so how did the logistic regression do?  You can see to the right the hypothesis plotted.  It comes pretty darn close with 98.7% correctly labeled with a modest iteration count of 1.5k

Maybe it was because a large portion of the code is shared, I found that once the underlying math was solid, Logistic Regression just worked.   If I crank up the iterations to 10x, I get a 99.9% of my data correctly labeled.  At 15x, I achieve 100%.

Obviously this is an extremely simple dataset.  It was generated by labeling anything to the right of a slope to be true… so for more complex datasets, I’m sure 100x iterations or more will be necessary.  I definitely feel the need to crack open the Numerical Recipes book and find more sophisticated function minimizing algorithm.

At 100k iterations, the Gradient Descent process took a noticeably longer time to compute for Logistic Regression than a comparable Linear Regression counterpart.  I’m guessing that this has to do with a more mathematically intense Sigmoid function.   At some point, I’ll go through and profile the hell out of the code to see where my hotspots are.   My assumption is that my inefficient Matrix math will be the leading offender.. but complex math like Sigmoid or Log are likely good low hanging fruit for optimization.


8×8 LED Matrix animation

It’s been a little while since I messed with my Arduino… so here we go.

This time around, I hooked up with an 8×8 LED matrix. Essentially, it’s an array of 64 LEDs addressable via 16 pins.

But wait, you ask… how can you address 64 pins with only 16 pins?? That’s an excellent question!  The answer is… not very well.  It turns out that you can only independently address one LED at a time with this system… well, really only one row at a time… otherwise you get unexpected LEDs lighting up.

The answer to this problem is essentially rastering.  Essentially this means that you display one row at a time.  You just make sure that the rate at which you display rows happens fast enough that the observer can’t see that you’re only ever have a single row illuminated.

I chose a really boring animation here.  It’s just an inverting rectangular region inverting itself.  This project was taken almost verbatim from project 19 in Michael McRoberts’s Beginning Arduino.

To the right, you see an animation of my LED Matrix. Kinda distracting, huh… sorry.

A bit later, I’ll use this LED matrix as a means of outputting an accelerometer.  I’d like to think that I can come up with some cool looking UI in 64 LEDs.

For the curious, here’s the code listing for this project:

#include <TimerOne.h>

const int gLatchPin = 11;
const int gClockPin = 13;
const int gDataPin = 12;

const int pLatchPin = 8;
const int pClockPin = 10;
const int pDataPin = 9;

int n = 0;

int icons[2][8] = {

void setup() {
    pinMode(pLatchPin, OUTPUT); 
    pinMode(pClockPin, OUTPUT); 
    pinMode(pDataPin, OUTPUT); 

    pinMode(gLatchPin, OUTPUT); 
    pinMode(gClockPin, OUTPUT); 
    pinMode(gDataPin, OUTPUT);

    registerWrite(gLatchPin, gClockPin, gDataPin, 0B00000000);

void matrixUpdate() {
    for(int i = 0; i < 8; i++) {
         registerWrite(pLatchPin, pClockPin, pDataPin, 1 << i);
         registerWrite(gLatchPin, gClockPin, gDataPin, icons[n][i]);

void loop() {
    n = 1 - n;

void registerWrite(int latchPin, int clockPin, int dataPin, byte data) {
     digitalWrite(latchPin, LOW);
     shiftOut(dataPin, clockPin, MSBFIRST, data);
     digitalWrite(latchPin, HIGH);

You can see from the code listing, (and probably from the image as well) that I ended up using two 8-bit latches.  I could have used less pins and cascaded the latches into a single 16-pin logical latch… but I didn’t out of lazyness.

The other thing of note in this code listing is the TimeOne library.  I had to grab it from the google code project. I was pretty surprised that you could schedule an interrupt driven timer at microsecond granularity.  Pretty amazing.

Getting back to basics – blender rtfm

A while back, I picked up the book Introducing Character Animation with Blender from blender3d.org.  I primarily bought it to support the blender foundation because I love using the software so much… but I finally got around to cracking it open.

Most of my blender knowledge comes from watching youtube videos.  This approach is good because it provided me with a lot of practical modeling skills, but it doesn’t delve much into the intention of the features built into blender.  I think that actually going through and reading the manual will be useful for me at this point.

I think I’ll combine this getting back to basics with the actual character design of “The Weirdo” for my video game project… kill two birds with one stone so to speak.


Gradient Descent and Normalized Data

Skynet is coming along nicely. I’ve gotten the chance to build out a fairly comprehensive Linear Regression object model.

The actual learning part of Linear Regression centers around finding the minimum value of something called the Cost Function; a function which measures how inaccurate the algorithm guesses at the answers.  The method that I’m using for finding the minimum value for the cost function is called Gradient Descent.  One thing that I noticed, though, was that Gradient Descent was taking FOREVER to converge to an optimal theta… (theta is the set of parameters which lead to a minimal cost function)  especially if the learning rate was low.  I had no idea just how slow it could be.

On the right, you can see a plot of my Hypothesis.  The hypothesis is the function that my AI is using to guess against a synthetic data set. The data set is very linear and really, an ideal hypothesis would run right down the center of the line.

In this case, Gradient Descent did not converge in time and it hit its maximum iterations before it could reach a reasonable value of theta.

Normally my options are I could increase my learning rate or add more iterations.   I wanted to see, however what using data normalization would do.

Normalizing data takes each data point, subtracts the mean of the data set and divides by the standard deviation (you can optionally divide by the range).   This gives you a dataset that’s centered around y=0 and has a much smaller range of y values.   So how does this change whether or not Gradient Descent converges? Let’s see…

Here, you see the convergence rates of Gradient Descent.  The red values represent un-normalized data and the blue represents the same dataset, but normalized.

I should have cut off the first few data points because they’re totally outliers, but you can see that the red line goes all the way off to the right which means that it hit its maximum number of iterations.  The blue line, however stops short.  It converged before maxing out its iterations.

How does the normalized hypothesis look plotted against the data? Take a look!

You can see that the hypothesis for the normalized data looks far more correct.

One thing to note before you dive into normalized data… whatever dataset you use your learned thetas on must also be normalized… otherwise you’ll be feeding your Linear Regression algorithm apples when it learned on oranges.  That’s very likely not going to produce desired results.

Applying Machine Learning

I’ve successfully completed Andrew Ng’s Machine Learning class. I feel pretty spiffy that I can now say that I’ve implemented a Neural Network… though right now it’s just in Octave.

In his concluding video, Andrew asked that we go forth into the world and do cool things with machine learning… and I plan on doing exactly that. I can think of several interesting things that I can build in the context of my work at Topsy Labs… but it’s unlikely that any work context stuff will filter out in a way that the public can see directly.. so I plan on creating a few personal projects around machine learning.

The first thing that I want to do is go through the course material and actually implement most of it in Java.  For that purpose, I’ve checked in the start of said project into github as part of my incubator repo. I’ve affectionately named the sub-project “SkyNET” after the largely misunderstood AI from the Terminator Franchise.  The code itself won’t have specific spoilers from the exercises… that’d be bad form… but they will have brute force implementations of the algorithms taught in the course.

I think that actually implementing things from scratch, I’ll get a better handle of how and why some of this stuff works.  For this initial version of Skynet, I am not going to allow myself the luxury of third party libraries.  The point isn’t to create a well optimized machine learning system… It’s a learning exercise.  Part two of Skynet will be to re-implement the whole thing again in C++.  This time I’ll incorporate a bunch of well optimized libraries like Boost.

Those of you who want to troll through my Java code… here’s the direct link to Skynet.

Modeling Die Rolls

Modeling out die rolls seems pretty straight forward.  Let’s say that you wanted to figure out the distribution of sums. The brute force method for doing this looks like the following chunk of perl code:

sub dist {
    my ($count, $sides) = @_;
    my $state = [];
    for (my $i = 0; $i < $count; $i++) {
        $state->[$i] = 1;

    my $max = $count * $sides;
    my $result = [];
    while(1) {
       my $sum = 0;
       for (my $i = 0; $i < $count; $i++) {
          $sum += $state->[$i];

       last if $sum >= $max;
       for (my $i = 0; $i < $count; $i++) { 
           last if $state->[$i] [$i] = 1;
   return $result;

This seems to work well for distributions of same sided dice. What if I wanted to combine multiple types of dice? Say, 4x six sided dice and 8 4-sided dice… well, the brute force method starts to get pretty expensive.

I was reading this lecture on normal distributions for sums and came to the solution to my problem.  Instead of iterating over all sums for all dice, I could probably just add the normal curves for the die rolls.

How do you compute a normal distribution? It’s called a Gausian… check the Wikipedia page!  The code for a gausian given two params, mu and sigma look like so:

sub gausian {
    my ($x, $mu, $sigma) = @_;

    my $exp = - 1/2 * (($x - $mu) / $sigma) ** 2;
    return 1/($sigma * sqrt(2 * $pi)) * exp($exp);

Now I just need to figure out how to scale mu and sigma (mean, std deviation) to number of dice and the sides per die.  The lecture above seems to model 6 sided dice well with a mean of 3.5 x number of dice and 1.7 * sqrt(num of dice) but I need to expand that out to other side counts for this to work.

Simple IC Blender Model

It’s been a little while since I did stuff in Blender.  Here, I made a 16-pin IC.  I tried to make the model fairly generic so that I can easily do larger or smaller chips. If you want to check out the model, I have it checked into github: