4 Comments

Summary:

Yahoo has released a massive dataset for researchers to experiment on. The dataset includes URLs for nearly 100 million images and 700,000 videos, as well as their metadata. Soon, a larger supercomputer-processed dataset that includes audio and visual features will be available.

In case you missed this like I did (I blame a Structure hangover), Yahoo released last week a huge dataset Flickr images and videos. Called the Yahoo Flickr Creative Commons 100 Million, it contains 99.3 million photos, 700,000 videos and the associated metadata (title, camera type, description, tags) for each. About 49 million of the photos are geotagged and, Yahoo says, comments, likes and social data for each are available via the Flickr API.

Needless to say, this is a pretty impressive resource for anyone wanting to analyze images for the sake of learning something or just to train some new computer vision algorithms. We have been covering the rise of new artificial intelligence algorithms and techniques for years, most of which have benefited from access to huge amounts of online images, video and related content from which to derive context. Often, though, researchers or companies not in possession of the content (that is, pretty much everyone but Google, Facebook, Microsoft and Yahoo) have had to scrape or otherwise gather this data manually.

That being said, Google and Yahoo, in particular, have been pretty good about releasing various large datasets, usually textual data useful for training natural-language processing models.

Just a taste of what the dataset looks like. Source: Flickr user David Sharma

Just a taste of what the dataset looks like. Source: Flickr user David Shamma (https://www.flickr.com/photos/ayman/14444554781)

To test out just one possible function of the new image dataset, Yahoo is hosting a contest to build the system best capable of identifying where a photo or video was taken without using geographic coordinates. The training set for the contest includes 5 million photos and 25,000 videos.

Yahoo is also partnering with the International Computer Science Institute (at the University of California, Berkeley) and Lawrence Livermore National Laboratory to process the data on a specialized supercomputer — the Cray Catalyst machine designed for data-intensive computing — and extract various audio and visual features. That dataset, which Yahoo claims is north of 50 terabytes (the original 100-million-photos data is only about 12 gigabytes), and tools for analyzing it will be available on Amazon Web Services later this summer.

Image courtesy of Flickr user David Shamma.

  1. QUOTE: “original 100-million-photos data is only about 12 gigabytes”

    That is exactly 128 bytes per photo. Probably photos are shrank to 10px*10px size.

    Reply Share
    1. Derrick Harris Friday, July 4, 2014

      Alternatively, it might be because the dataset contains URLs to the images.

      Reply Share
      1. “The dataset includes URLs for nearly 100 million images and 700,000 videos, as well as their metadata.”

        Reply Share
  2. As an individual who found the locations of many Japan Tsunami related videos uploaded to Youtube ( by using Google Maps ) , I would say good luck with that rather ambitious enterprise of geolocating photos or videos. Why can’t Yahoo do it themselves ?

    Reply Share