7 Comments

Summary:

One of the big themes at our Structure Data conference in March is the advent of new techniques to make sense of new data sources. One of the most-promising is video, which had value well beyond capturing crimes and making us laugh on YouTube.

Source: Prism Skylabs
photo: Prism Skylabs

For most of its time as an IT industry buzzword, big data has been focused on numbers and letters. Sales numbers, medical results, weather, sensor readings, tweets, news articles — all very different, but also all relatively low-hanging fruit. Now, however, it looks like video is emerging as the next great source for companies to learn about consumers, and for everyone to learn about the world around them.

Thanks to surveillance cameras, GoPros, Dropcams, cell phones and even old-fashioned camcorders, we’re able to record video at unprecedented scale. YouTube sees 100 hours of new content added every minute. But it has been kind of a wasteland of information. There’s lots of it embedded in all those frames, but without accurate tags or someone willing to watch all that video, it might as well have been uploaded into a blackhole.

Who’s in them, what’s going on in them and where are they shot? Who knows.

Taking computer vision to the next level

Lately, though, techniques such as deep learning and other varieties of machine learning have led to impressive advances in areas such as computer vision, speech recognition and language understanding. The companies doing this research, largely at places like Google, Microsoft and now Yahoo, are already using the technology in production on things like voice commands on gaming consoles and cell phones, and on recognizing images in online galleries in order to label and categorize them.

It’s not too big a step to turn these techniques toward video. Researchers at the University of Texas are already using object recognition to create short summaries of long videos so people can know what they’re about without having to rely on titles alone. According to AlchemyAPI Founder and CEO Elliot Turner (who’ll be speaking at Structure Data in March about the promise of delivering artificial intelligence capabilities via API), video is in some ways actually easier to work with than images because the temporal natures of the frames adds context that can help self-learning systems understand what’s happening.

From the UT research, showing how algorithms calculate the key objects in each frame. Source: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

From the UT research, showing how algorithms calculate the key objects in each frame. Source: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

But video data has even more utility than just helping web companies like Google or Facebook understand what’s happening in YouTube or Instagram videos. It’s also a window into our physical world like no other type of data before.

Retail is ground zero for video analytics

Retailers and companies targeting their business have been particularly quick catching onto this. Already, they’re using video analysis to figure out when stores are the busiest and where people are walking, stopping and looking. Some are even using eye-level cameras to identify which items people are looking at on fully stocked shelves. Facial recognition software is helping stores assess shoppers’ age, sex and race to target ads and provide accurate data about consumer demographics.

Steve Russell, founder and CEO of a video analytics company called Prism Skylabs — and who’ll be discussing the role of video as a new source of business intelligence at Structure Data — said the goal is partially to give brick-and-mortar retailers the type of information that e-retailers already get about what people look at but don’t buy. Having that type of information can help retailers get a better sense of what inventory they should carry even where they should put it in the store.

“Imagine if all Amazon knew is how much stuff they sold?” he asked during an interview back in November.

Steve Russell

Steve Russell

It’s not just about tracking customers’ activity, though. Russell explained that Prism Skylabs’ technology actually uses advanced computer vision techniques (he calls them “super-resolution algorithms”) to take people out of the picture and give its users a clear view of an empty store, even if they’re using low-resolution cameras that often look grainy without processing. This helps with privacy concerns but also gives retailers and their merchandisers real-time views into a store to make sure they’re clean, shelves are stocked and that other protocols are being followed.

“All of those are questions that a merchandiser will have to travel to a store with a clipboard to answer, and it’s incredibly expensive,” Russell said.

Source: Prism Skylabs

Source: Prism Skylabs

And given the low costs of cameras now, and the fact that services like Prism are delivered via the cloud, companies can get as granular as they want with their video analysis without having to worry about breaking the bank on cameras or servers and software to store and process the data. Video provides what Russell calls a “vast array of useful tidbits,” and it clearly has potential beyond the retail space. But, he said, “We don’t know of all the things we can do potentially with this very interesting type of data.”

One thing he does know, though: “The core problems of computer vision have largely been solved in the past few years.” Today, Russell added, if you have access to a training set of images and an established truth of what they are and mean, “You can train a computer to do amazing things.”

  1. I can’t wait to see where video processing will be in a couple years! The applications are mind-boggling.

    Here is research I’ve been following, that shows what people are working on (in addition to the ones you list):

    1. There’s the work of Cees Snoek and his group, on representing videos and images by sentences:

    http://www.ceessnoek.info/index.php/mm13-video2sentence-and-vice-versa/

    http://www.ceessnoek.info/index.php/mm13-querying-for-video-events-by-semantic-signatures-from-few-examples/

    2. Richard Socher et al have something similar:

    http://nlp.stanford.edu/~socherr/SocherLeManningNg_nipsDeepWorkshop2013.pdf

    3. And here is a paper that uses methods quite different from those above:

    http://arxiv.org/abs/1308.6628

    There are at least a dozen more approaches than the ones I have listed here.

    4. Have a look at the first 5 minutes of this talk by Ruslan Salakhutdinov:

    http://techtalks.tv/talks/recent-applications-of-deep-boltzmann-machines/58082/

    Not only can neural nets be used to classify images and videos, but they can even GENERATE them, based on a test description — imagine inputting “Cow runs up hill,” and then seeing a short video clip of a Holstein running up a grassy knoll, a kind of “machine dream”. Maybe that will be possible in the near future.

    5. Using something called a “denotational graph”, images and their annotations can be used to help with NLP tasks, like entailment:

    http://nlp.cs.illinois.edu/HockenmaierGroup/Papers/DenotationGraph.pdf

    And, of course, there is the NEIL project out of Carnegie Mellon, that is mining images for certain types of common sense knowledge; videos should contain even more common sense knowledge, that can be mined with the right algorithms.

    6. Emotion recognition of video using a symphony of neural net models:

    http://dl.acm.org/citation.cfm?id=2531745

    7. There’s also the work of Tomaso Poggio, who has a whole theory about how parts of the brain process images, based on applications of the Johnson-Lindenstrauss Theorem and learned convolutions/inner-products. He and his group have some really interesting results.

    Share
    1. Michael MacMillan Sunday, February 2, 2014

      These have been around a while. Prism Skylabs other than super resolution is not new and we have been implementing the same in retailers since 2005 http://www.visualize.net and there is already an app on Apple wordeo doing text to video so it is not just retailers delivering this type of application in real world.

      Share
  2. Derrick Harris Friday, January 24, 2014

    Best. Comment. Ever. Seriously informative. Thank you.

    Share
  3. I would rather see them use it to serve ads for videos we watch. We need to get to free media and smart product placement combined with this would make it a lot easier. So IMO Youtube and Netflix should be at the forefront of this. Guess for Google, it’s more than just that, offering information about everything in a video would be their business . It’s also something we need from glasses so pretty much all the OS makers should be chasing it.

    Share
  4. Very interesting article. My take away is that video will be increasingly added to the mix of multimodal Big Data to solve some interesting problems.

    One thing I would take issue is with the ending point about core of computer vision problem being solved with training set of images…simply focusing on one media/modality at a time (in spite of continuing improvements in computational techniques) has brought little value. The reason is in the article itself: “video is in some ways actually easier to work with than images because the temporal natures of the frames adds context that can help self-learning systems understand what’s happening.”

    Essentially coreferencing and synergistically using multiple modality (just as human brains processes multimodal inputs from multiple senses for highly effective situational awareness with ability for focusing of deeper insights on issues of higher interest/importance) utilizing prior knowledge (just as humans apply past knowledge and experiences) is what will bring biggest dividends.

    Just consider value of exploiting locational information. This can help identify and leverage

    (a) a priori built model for partial context (eg if you know the clip relates to kitchen, then model of a kitchen could tell you appliances likely to be present, and potential proximity/locations to improve recognition (that can also allow tapping into the database of equipment and perhaps pretrained classifiers that help identify them better)

    (b) relevant background knowledge (what are known entities at or near that location, what have citizen and machine sensors identified to be present at or near that location?)
    from open data (from for example, but not limited to Wikipedia) to have more context, which can then lead to more focused processing for better extraction and situational awareness.

    Furthermore, if objects/events of application/human interest (e.g., gathering of a crowd, close proximity of police and protesters, existence of a fire amongst a cloud) are predefined, then analysis can be made even more effective.

    In summary, when video is added to a number of other modalities and prior/background knowledge, it will be become a very effective new form of information (as opposed to building applications limited to just more sophisticated exploiting video/images/pixels).

    Share
  5. Venkitachalam v Sunday, January 26, 2014

    Derrick. Great Article. I agree that the AV tools and video content analysis is going to be major tools for data analysis in future. In the beginning, the IP CCTV system started getting attention as a security measure and now it has become a business solution for improvement. Video data is used for business intelligence. The video analysis software has an important role in getting the best results. Its scalability, ability to integrate with third party video management system and generate consolidated reports is key factors. As on date, the technology is not used to its full potential. It is the next big thing.

    Share

Comments have been disabled for this post