Our quest to build artificially intelligent robots has five challenges. The first, which we covered yesterday, was seeing. But let’s say you solve that problem, and a robot can recognize everything in the refrigerator–we can’t do anything like that now, but let’s just imagine it will. They still wouldn’t be able to understand anything because they cannot contextualize.
If you were driving through the town and saw a soccer ball bouncing in the street and a child running out to get it and a woman frantically waving the child down, you would instantly know exactly what was going on. But to a computer, that’s just a bunch of pixels changing color. Really it’s just a bunch of 1s and 0s changing.
Just think for a minute how easy it is for a human to find out what’s going on in a photo. You could say “that’s a conga line” and this one is people hiding for a surprise party, and this is a prom photo taken by a parent, and this one is a piano recital, a school play, a christening, and so forth. Every one of those is easy for us because we have the cultural context to decipher it.
Now in theory, you can train a computer to do all of that. If you show it enough conga lines, it will get really good at spotting conga lines. However, that just brings us to our third challenge.