11 Comments

Summary:

The relevancy-defining, edge-weighting algorithms of Google’s Knowledge Graph, Facebook’s Open Graph and Gravity’s Interest Ontology are closely guarded company secrets. Imagine if that data was available to everyone — it would be as disruptive as Amazon Web Services. The internet would be a better place.

benedetto4

Everyone is always asking me how big our ontology is. How many nodes are in your ontology? How many edges do you have? Or the most common — how many terabytes of data do you have in your ontology?

We live in a world where over a decade of attempted human curation, of a semantic web has born very little fruit. It should be quite clear to everyone at this point that this is a job only machines can handle. Yet we are still asking the wrong questions and building the wrong datasets.

Understanding NLP

The exponential growth of data created on the web has naturally led to a desire to categorize that data. Facts, relationships, entities — that is how those of us who work in the semantic world refer to structuring of data. It’s pretty simple actually. Because we are humans, it happens so quickly in our subconscious minds that it’s incredibly easy to take it for granted if you don’t work on teaching machines to do it.

It’s also not a new field; deconstructing human language into structured data (natural language processing) has been around for almost 40 years. NLP can take the sentence “Jim is writing an article about why people ask the wrong questions about ontologies” and structure it into

benedetto1

NNP = Proper Singular Noun
VBZ = Verb, 3rd Person Singular Present
VBG = Verb/Present Participle
DT = Determiner
NN = Singular Noun
IN = Preposition
WRB = Adverb
NNS = Plural Noun
VBP = Verb, Non 3rd Present Singular Person
DT = Determiner
JJ – Adjective

That’s pretty impressive — a machine just did that. I bet you couldn’t label all of those (maybe your high school English teacher could). But you can understand what the sentence means less than a hundred milliseconds after reading it, and that’s what really matters. The machine has no understanding of the information the sentence conveys. Its job is to decompose unstructured language into structured data that another system might be able to understand.

That’s where semantics come in. Semantics try to understand the relationships between things (we call them entities, or nodes, if you really want to go down the rabbit hole).

Jim [PERSON]  -> writes [ACTION] -> sentence [THING]. Seems like a something a child could do right? The human brain is amazing.

Semantic analysis isn’t easy

Try this one: “I paddled out today, and dude, I look like a lobster.”

What does that mean? We know someone is talking about himself because of the leading personal pronoun. NLP won’t help us with the rest, but with “today” most good entity extraction engines can tell us we’ve got a time period (maybe even future intent — exciting!). We can use publicly available ontology data from Freebase, Wikipedia or DBpedia (or many others) to determine paddle [disambiguates to CANOEING], dude [ PERSON - TYPE OF GENDER] and lobster [COMMERCIAL CRUSTACEANS].

So we’ve got:

[PERSONAL REFERENCING]
[CURRENT TIME PERIOD]
[CANOEING]
[GENDER MALE]
[COMMERCIAL CRUSTACEANS]

This is like an ad server’s dream! Whoever tweeted this needs to be pummeled and retargeted with Red Lobster ads for months. I actually have set up sites with this sentence and tasked many other IAB-focused ad systems to recognize it — and it’s all Red Lobster all the time. I’ve enjoyed many half-off cheese biscuits in the last 12 months (R&D sometimes bears not only fruit but also cheesy biscuits).

But I wasn’t talking about canoeing or lobsters. When I’m not working I surf and, unfortunately, occasionally I do get sunburned — sometimes to the point of being told I look like a lobster. That’s what I was conveying in my tweet. Why is it so easy for us to understand but so hard for a machine to understand?

But maybe this is just a funny edge case. You can confuse any computer system if you try hard enough, right?

Unfortunately, this isn’t an edge case. Lexicons used to be considered different languages or different colloquial terms specific to particular industries before Twitter. This is no longer true: 140 characters has not just changed people’s tweets, it has changed how people talk on the web. More and more information is being communicated in smaller and smaller amounts of language, and this trend is only going to continue. #exponential

So why is there not a semantic web? Why can’t we solve this yet? Why can’t computers understand that “I’m a lobster” means you are sunburned and not that you want cheesy bread?

Not just connections, but connections that matter

I believe the reason that there are not hundreds of companies exploiting machine learning techniques to generate a truly semantic web is the lack of weighted edges in publicly available ontologies. “Lobster” and “sunscreen” are seven hops away from each other in DBpedia — way too many to draw any correlation between the two. (Any article in Wikipedia can be connected to any other article within about 14 hops, and that’s the extreme. Meanwhile, completely unrelated concepts are often just a few hops from each other.) But by analyzing massive amounts of both written and spoken English text from articles, books, Twitter and television, it is possible for a machine to automatically draw a correlation and create a weighted edge that effectively short circuits the sevens hops otherwise necessary.

Many organizations are dumping massive amounts of facts without weights into our repositories of total human knowledge because they are naively attempting to categorize everything without realizing that the repositories of human knowledge need to mimic how humans use knowledge.

For example: As of today, Kobe Bryant is categorized under 28 categories in Wikipedia, each of them with the same level of attachment (one hop in a breadth- or depth-first traversal).

benedetto2

But when you are at a coffee shop and overhear the person next to you mention Kobe Bryant, what are you able to infer they are talking about? “Basketball” or “American Roman Catholics”? How can the human brain infer that so quickly yet machines get so confused? It is not due to lack of technical processing power, Moore’s law slowing down or the thickness of our silicon wafers — it’s because of the data.

This is a small example of what  someone who works with graph theory would come up with if he or she were to run a standard few-hop depth first traversal from Kobe Bryant on Wikipedia and attempt to coalesce around a common category:

benedetto3

So when someone tweets about Kobe Bryant, are they talking about people born in 1970,Pennsylvania, Food & Drink, or Canadian Inventions? This is a common example of how confused a machine can become when the distance of unweighted edges between nodes is used as a scoring mechanism for relevancy.

But what happens if we weight our edges? The same Wikipedia nodes with path costs can be run through a traversal algorithm that calculates those costs and we get the following:

benedetto4

Our machine is starting to think like a human.

Algorithms and processors aren’t enough

Weighted path traversals are not new. Dijkstra’s algorithm was invented in 1956 (the answer has been around for a long time), but the processing power and memory necessary to leverage a traversal algorithm like Dijkstra’s — and score path costs and not just distance between nodes across massive data sets — has only in the last few years become available to the average startup. That’s a huge win for all of us, but the data and ontologies to actually do it are still not publicly available.

I propose as an industry we begin to focus more on relevancy and less on factual accuracy. The above flawed traversal is actually 100 percent factually accurate. Kobe Bryant was born in 1979, he is (or was) sponsored by Gatorade, Gatorade is a drink and basketball was invented in Canada. But even now that you know all those facts, when you hear someone talk about Kobe Bryant tomorrow you will still know they are talking about basketball.

The only way we will actually get to a truly semantic web is when machines are able to think (or, more accurately, perform) as we do. The processing power, technology and algorithms to do that exist today. Unfortunately, said power is unleashed on inherently flawed datasets, and that is why we still see Red Lobster ads on sunscreen pages. We need to become much less focused on adding facts to Freebase, DBpedia and the other publically available ontologies, and much more focused on weighting the edges between the facts that we are adding.

That is how we create a truly semantic web. The answer lies in the data, but not in the data available on a web page or in a set of thousands of web pages available to be recommended by a particular algorithm. Informational retrieval and categorization techniques such as LSI, PLSI,  and LDA are only aware of the context of the information in the datasets fed to them. These base algorithms (LDA, especially) are incredibly useful, but without the context of a global human knowledge base, you cannot build an interest graph, and you cannot build a semantic web.  

Ontologies become absolutely necessary as we attempt to solve this problem. If you feed into any of the above algorithms 10 articles a particular person read about snowboarding, they will successfully recommend other snowboarding articles, but are unaware that snowboarding and surfing are two sports that go hand in hand. People who enjoy one usually enjoy the other. An ontology with weighted edges is necessary to make that relevant yet tangential connection, which is a crucial step to avoid the dreaded “filter bubble.”

Structure Data 2012: Jim Benedetto – CTO, Gravity Ashlie Beringer – Partner, Gibson, Dunn & Crutcher

Benedetto (center) at Structure: Data 2012.

A Wikipedia for weighted edges

So to all of my semantic colleagues out there, maybe we should shift our thinking and begin to use a different yardstick to measure the quality of our knowledge repositories. For 99 percent of all use cases we have enough nodes. We have successfully deposited the majority of places, events, people, thoughts, and most other tangible and intangible things in the world into our data stores — and a good portion of the population has access to all of it from the smartphone in their pocket. That is an incredible feat. But it’s only half of the equation.  We still have yet to map the data into a format that mimics how the human mind thinks.

The way to do this is to begin weighting the edges that interconnect the nodes and facts that we are adding every day. It requires us to raise the bar from factually accurate to actually relevant. Kobe Byrant -> Philadelphia is factually accurate, but Kobe Bryant -> Basketball is actually relevant. Today’s ontologies make no distinction between those two facts, and without that distinction a machine will never be able to create the semantic web we have been working towards for almost a decade.

Every fact in Wikipedia was added by a human. Weighting all of the edges between those facts sounds like a monumental task. But crowdsourcing the creation of a central repository of all human knowledge sounded impossible a little over a decade ago, and we’ve done a very good job of that.

It wasn’t too long ago that running an elastic cloud infrastructure was something that was available to only the largest internet companies in the world. Amazon changed that. Now, one smart engineer can turn an idea into a company for $50 a month. But there is still a large divide between one smart engineer and companies that can use machines to perform web-scale semantic analysis of content.

The relevancy-defining, edge-weighting algorithms of Google’s Knowledge Graph, Facebook’s Open Graph and Gravity’s Interest Ontology are closely guarded company secrets. Imagine if that data was available to everyone — it would be as disruptive as Amazon Web Services. The internet would be a better place.

At Gravity, we have combined many publicly available ontologies with our own internally generated facts and weights to create a large  interest-based undirected graph that leverages many forms of edge weighting to solve the above problem. For many years, we built and protected this as a company secret. In the last year we have realized that our mission — building a web-scale personalization platform — takes a lot more then an ontology with weighted edges. It’s an iceberg problem that looks simple when you are designing collaborative filtering for an app or yield-optimizing by user for your site, but our mission is a platform for the entire web.

A relevancy-based ontology with weighted edges is absolutely necessary, but it is just the beginning. That’s we are also formalizing a plan to develop an open, centralized place to allow human and machine curation of ontology edge weights for the community. We plan on contributing a significant amount of our data to get the project started. More on that to come.

Until then, as a community, I believe we should begin to focus more on relevancy and relationships, and less on the continued addition of facts to our publicly available semantic resources.

Jim Benedetto is co-founder and CTO of Gravity.

  1. I know nothing about the subject but seems to me that in the case of the ”I paddled out today, and dude, I look like a lobster.” example a better NLP would be good enough if it would categorize “like”as a subordinating conjunction.

    Share
  2. =) i do love this dialogue, and the problem is so well exemplified in your example: “Kobe Byrant -> Philadelphia is factually accurate, but Kobe Bryant -> Basketball is actually relevant.”

    “Really?!!!” says his mother. “As if the birthplace of a human is less relevant than just one of his careers,” supports The Historian.

    The ethics of relevancy is so susceptible corruption on the most simple levels. Simply attributing more relevance to one of these statements over the other is where the ball dropping begins. Which is why semantics, well, sigh, we are a world at war — they are simply tools of a small-minded world.

    Besides pissing off his mother or others who are more human-based rather than sports-based, I could take this a step further and see this type of attribution of relevance as racist (or at least being perceived as racist) on a very simple level – did you just try to make a racist distinction? No,probably not. Yet, I know well the segment of the population that this would be considered racist to put his game before his birthplace, as if that is all black people are good for/known for/celebrated for/given relevance for, etc.

    I completely agree with your statement that we should stop what we are feeding into this type of graphing systems as the more we empower them, the more they are empowered and will become the operational reality, but perhaps its too late to put the genie back in teh bottle and we should just work on the ethical base that is acceptable moving forward.

    SIgh, a human rights nightmare to be certain. Thank you for paying attention to it. As for the “secrecy” that should be transparent. We should be able to see the psychology of the designs.

    Share
    1. Nikohl –

      I completely agree and you have obviously thought a lot about this space.

      You’ve pointed out some difficult but very valid edge cases. And there are millions more. How can we teach machines to think like humans when we can’t even get humans to agree on anything :)

      In the early days of the company at Gravity we had many of these debates. Is the edge weighting algo that traverses “Fishing” through the graph to “Sport” more or less accurate than the one that takes it up to “Hobbie”? We asked 50 people and got nowhere – the answer depends on if you fish.

      We had to have a standardized way to quantify the stability of our graph so after a couple days of intense debates between those of us that fished and those of us that didn’t we designed what we internally call “The Coffee Shop Scenario”.

      Sitting in a coffee shop, you overhear just a phrase or a sentence from the table next to you. Someone then comes up to you and asks “What were they talking about”. You will get different answers, but in a vast majority of cases/words/phrases a statistically relevant majority will agree. We base the success or failure of our algorithms on determining what the majority of most people in the coffee shop would say.

      There are flaws to this approach because sometimes there isn’t one answer (fishing is one of many examples – Sport vs Hobbie depends whether or not you enjoy fishing), but that is the beauty of an undirected graph – there doesn’t have to be one correct answer. We just need a lot less wrong ones.

      Jim
      Gravity.com
      @jimbenedetto

      Share
      1. Random graph would b my option

        Share
      2. Because once u start the coffee shop dialog it becomes a weighted discussion based on ___. eventually, there’s a quantitative answer to a qualitative question. Nice to know others are thinking too.

        Share
  3. Reblogged this on Grandiose Data Delusions and commented:
    This is a great discussion of natural language processing and semantic search.

    It is well worth the time to read and then re-read!

    Share
  4. Great article

    Share
  5. I’m working since 2 decades on semantic technologies and a few years ago I did some research on relevance and in particular on relevance decay over time.
    I came to the conclusion that relevance (in this context) is not something absolute associated with a piece of information.
    In most cases relevance is changing with the context of usage.
    In the lobster example the context would be something like “outdoor sports”. This context should strengthen the links between sunburn, red (skin) and from there to lobster (red color).
    Probably “look like a lobster” should be a concept in the network.

    Thus adding weights to links between concepts: yes, but more then one (and possibly many).

    Share
  6. Dario Boriani, Ph.D., CSSBB Monday, November 25, 2013

    Interesting article. It would seem that ‘it’ is not about the data either. Or, not exclusively. It is, at least in part, about the knowledge and background of the eventual users of the data (broader than usage context), and about the different interpretations of the same data that these may cause to occur not only among different users, but also for the same user over time. Unless one can target groups of users with known, static characteristics over a brief time frame, leading to a potentially narrower spectrum of interpretations, it would appear that a participatory role for users in actually defining tailored relevance — as opposed to being passive recipients of pre-established ones — might be called for, as well as one for learning systems scoring both interpretation successes and failures dynamically and tuning hypotheses as to interpretations, as how could relevance truly be established separately from and before actual usage? Thanks again.

    Share
  7. The problem, even with a properly edged semantic web, is that it doesn’t grow as knowledge grows or adapt: it’s like a copy of your grandparents 1950′s Webster Dictionary. To take the racist thread (although without diving into the issue itself), someone’s grandparents might think “Oriental” is a perfectly acceptable term for SE Asian identity. They’re old and they’re not going to change their views on the subject, because well, they’re old. Anytime they search for ‘Oriental’ would likely be broadly Sino-Japanese because that was the focus of the world /during their lifetimes/. They don’t intend any harm and really their isn’t any, per se in a search for furniture, culture, etc. (Speaking about or to someone is another category). Regardless of whether it was an acceptable term, that’s the information they’re looking for.

    Now a 15 year old comes along and looks up the term “Oreintal, and they would expect to get a treatise on racial stereotyping, the history of the civil rights movement, Japanese internment and probably some other relevant information. Why? Because Oriental would not (or probably should not) be a term used in any other context for a 15 year old.

    And the generation between these two extremes might go either way depending on politics, past travel, and geographic upbringing.

    The computer question is, which “edge” is more relevant? The answer is, it depends on the person. And unfortunately, at the point where the computer knows enough about you to get that search correct, most people will be supremely uncomfortable with the size of their online dossier.

    Share
  8. I could not help but think of the Google “PageRank” algorithm when reading this – surely you can use the number and quality of links to an entity to determine broad statistical relevance? (Maybe this is the top secret weighting algorithm of Google’s Knowledge Graph?)

    I agree with the other comments along the lines of :- “relevancy is in the eye of the beholder”. The relevancy I would place on certain connections has more to do with where I went to school, my social network and broad life experiences… a very difficult thing for people to know about, understand and apply, never mind computers. Maybe my relevance scores are even how I would like to be more than how I am….

    Great article though, thanks!

    Share

Comments have been disabled for this post