I have written quite a bit about GDELT (the Global Database of Events, Languages and Tone) over the past year, because I think it’s a great example of the type of ambitious project only made possible by the advent of cloud computing and big data systems. In a nutshell, it’s database of more than 250 million socioeconomic and geopolitical events and their metadata dating back to 1979, all stored (now) in Google’s cloud and available to analyze for free via Google BigQuery or custom-built applications.
On Thursday, version 2.0 of GDELT was unveiled, complete with a slew of new features — faster updates, sentiment analysis, images, a more-expansive knowledge graph and, most importantly, real-time translation across 65 different languages. That’s 98.4 percent of the non-English content GDELT monitors. Because you can’t really have a global database, or expect to get a full picture of what’s happening around the world, if you’re limited to English language sources or exceedingly long turnaround times for translated content.
For a quick recap of GDELT, you can read the story linked to above, as well as our coverage of project creator Kalev Leetaru’s analyses of the Arab Spring and Ukrainian crisis and the Ebola outbreak. For a deeper understanding of the project and its creator –who also helped measure the “Twitter heartbeat” and uploaded millions of images from the Internet Archive’s digital book collection to Flickr — check our Structure Show podcast interview with Leetaru from August (embedded below). He’ll also be presenting on GDELT and his future plans at our Structure Data conference next month.
Leetaru explains GDELT 2.0’s translation system in some detail in a blog post, but even at a high level the methods it uses to achieve near real-time speed are interesting. It works sort of like buffering does on Netflix:
“GDELT’s translation system must be able to provide at least basic translation of 100% of monitored material every 15 minutes, coping with sudden massive surges in volume without ever requiring more time than the 15 minute window. This ‘streaming’ translation is very similar to streaming compression, in which the system must dynamically modulate the quality of its output to meet time constraints: during periods with relatively little content, maximal translation accuracy can be achieved, with accuracy linearly degraded as needed to cope with increases in volume in order to ensure that translation always finishes within the 15 minute window. In this way GDELT operates more similarly to an interpreter than a translator. This has not been a focal point of current machine translation research and required a highly iterative processing pipeline that breaks the translation process into quality stages and prioritizes the highest quality material, accepting that lower-quality material may have a lower-quality translation to stay within within the available time window.”
In addition, Leetaru wrote:
“Machine translation systems . . . do not ordinarily have knowledge of the user or use case their translation is intended for and thus can only produce a single ‘best’ translation that is a reasonable approximation of the source material for general use. . . . Using the equivalent of a dynamic language model, GDELT essentially iterates over all possible translations of a given sentence, weighting them both by traditional linguistic fidelity scores and by a secondary set of scores that evaluate how well each possible translation aligns with the specific language needed by GDELT’s Event and GKG systems.”
It will be interesting to see how and if usage of GDELT picks up with the broader, and richer, scope of content it now covers. With an increasingly complex international situation that runs the gamut from the climate change to terrorism, it seems like world leaders, policy experts and even business leaders could use all the information they can get about what’s connected to what, who’s connected to whom and how this all might play out.