30 Comments

Summary:

The ability to distribute real-time information through social networks like Twitter is a powerful thing, but a new study points out that one of the downsides of this phenomenon is the fact that much of the content that gets linked to eventually disappears.

shutterstock_111156467

One of the characteristics of the modern media age — at least for anyone who uses the web and social media a lot — is that we are surrounded by vast clouds of rapidly changing information, whether it’s blog posts or news stories or Twitter and Facebook updates. That’s great if you like real-time content, but there is a not-so-hidden flaw — namely, that you can’t step into the same stream twice, as Heraclitus put it. In other words, much of that information may (and probably will) disappear as new information replaces it, and small pieces of history wind up getting lost. According to a recent study, which looked at links shared through Twitter about news events like the Arab Spring revolutions in the Middle East, this could be turning into a substantial problem.

The study, which MIT’s Technology Review highlighted in a recent post by the Physics arXiv blog, was done by a pair of researchers in Virginia, Hany SalahEldeen and Michael Nelson. They took a number of recent major news events over the past three years — including the Egyptian revolution, Michael Jackson’s death, the elections and related protests in Iran and the outbreak of the H1N1 virus — and tracked the links that were shared on Twitter about each. Following the links to their ultimate source showed that an alarming number of them had simply vanished.

After two and a half years, 30 percent had disappeared

In fact, the researchers said that within a year of these events, an average of 11 percent of the material that was linked to had disappeared completely (and another 20 percent had been archived), and after two-and-a-half years, close to 30 percent had been lost altogether and 41 percent had been archived. Based on this rate of information decay, the authors predicted that more than 10 percent of the information about a major news event will likely be gone within a year, and the remainder will continue to vanish at the rate of .02 percent per day.

It’s not clear from the research why the missing information disappeared, but it’s likely that in many cases blogs have simply shut down or moved, or news stories have been archived by providers who charge for access (something that many newspapers and other media outlets do to generate revenue). But as the Technology Review post points out, this kind of information can be extremely valuable in tracking how historical events developed, such as the Arab Spring revolutions — which the researchers note was the original impetus for their study, since they were trying to collect as much data as possible for the one-year anniversary of the uprisings.

Other scientists, and particularly librarians, have also raised red flags in the past about the rate at which digital data is disappearing. The National Library of Scotland, for example, recently warned that key elements of Scottish digital life were vanishing into a “black hole,” and asked the government to fast-track legislation that would allow libraries to store copies of websites. Web pioneer Brewster Kahle is probably the best known digital archivist as a result of his Internet Archive project, which keeps copies of websites dating back to the early days of the web (Kahle also has a related project called the Open Library).

Getting access to social data is not easy

Although the Virginia researchers didn’t deal with it as part of their study, a related problem is that much of the content that gets distributed through Twitter — not just websites that are linked to in Twitter posts, but the content of the posts themselves — is difficult and/or expensive to get to. Twitter’s search is notoriously unreliable for anything older than about a week, and access to the complete archive of your tweets is only provided to those who can make a special case for needing it, such as Andy Carvin of National Public Radio (who is writing a book about the way he chronicled the Arab Spring revolutions).

As my colleague Eliza Kern noted in a recent post, an external service called Gnip now has access to the full archive of Twitter content, which it will provide to companies for a fee. And Twitter-based search and discovery engine Topsy also has an archive of most of the full “firehose” of tweets — although it focuses primarily on content that is retweeted a lot — and provides that to companies for analytical purposes. But neither can be linked to easily for research or historical archiving purposes. The Library of Congress also has an archive of Twitter’s content, but it isn’t easily accessible and it’s not clear whether new content is being added or not.

Twitter has talked about providing a service that would let users download their tweets at some point, but it hasn’t said when such a thing would be available — and even if users did create their own archive in this way (or by using tools like Thinkup from former Lifehacker editor Gina Trapani) it would be difficult to link those in a way that would provide the kind of connected historical information the Virginia study is describing. And it’s not just Twitter: there is no easy way to get access to an archive of Facebook posts either, although users in Europe can request access to their own archive as a result of a legal ruling there.

For better or worse, much of the content flowing around us seems to be just as insubstantial as the clouds that it is hosted in, and the existing tools we have for trying to capture and make sense of it simply aren’t up to the task. The long-term social effects of this digital amnesia remain to be seen.

Post and thumbnail images courtesy of Shutterstock user Ribah and Flickr user See-ming Lee

  1. We have an app called Mettya (www.mettya.com) that solves this problem and a few others. It allows you to pull photos/videos from across the social web through keywords. The content is actually stored in the database so you can come back to it anytime. http://youtu.be/IwlOEVYEKmM And a vimeo http://vimeo.com/48667055

    Share
  2. While that sounds like a lot of lost information, considering the amount of information being created, that is still a lot more saved than lost, by several magnitudes, as compared with classical media just 50 years ago.

    More importantly, we need to distinguish between information and knowledge. The value of information is in every byte, while the value in knowledge is in the actual message. As knowledge is what truly allows for human development and societal advancement, this is what we should ensure does not get lost. In my opinion, the “weeding” out of some of the information actually aids in removing the haystack of information to find the needle of knowledge. Of course, some of the lost information may encompass a complete message of knowledge, but thinking in these terms will allow us to focus our energy on the actual goal, instead of the likely impossible goal of total recall.

    Share
    1. You are so right. The information has not disappeared. We live in an era where vastly more is recorded than ever before.

      Even today, most conversations are not recorded, but many more are.

      And the presumption here appears to be that a broken link means the data is gone? That’s just not how it works. Old links that lead to a “file not found” error often leave easy clues to where that data is. It could have been removed (a better word than “disappeared”) but quite likely that redirects changed, or a page file was relabeled and still exists on a drive somewhere. A few seconds search with the clues from the social media post and some common sense will find the source quite often.

      Share
  3. But do we need to keep everything? That’s the question. We don’t need to keep all of the day-to-day press so why keep every other little piece of data?

    Share
    1. We almost certainly don’t need to keep everything. The problem is that it’s virtually impossible to tell what historians will find significant in future – five, ten, twenty, fifty, a hundred years from now. Archaeologists can learn a lot from middens (rubbish heaps)! Sometimes the things we regard as insignificant now are very telling to future historians.

      Share
  4. I created an IFTTT recipe to archive my -most important- tweets.
    Regards,
    Michele (@steccami)

    Share
    1. Didn’t Twitter just shut that firehose down though?

      Share
      1. Kirsten Lambertsen Tuesday, September 25, 2012

        Yes, they did. That IFTTT recipe will soon be deleted.

        Share
  5. If only this was true. The problem now is that EVERYTHING is kept. Just cos you can’t see it doesn’t mean it’s not there. Give Google a shout. There’s far too much data and not enough intelligent humans to interpret it sensibly. We’re not meant to keep everything and one day, when the semantic web becomes a reality, it will come back to bite us.

    Share
  6. I can’t see how 30% of information being un-available after two and half years means that much of the content we use has been lost?

    I’d echo the comments below and go further and ask if only 30% data loss is enough? I’d like a way of setting a decay date in data – especially on sites like facebook and twitter to say don’t show this after x months.

    In a similar way should data be anonymised after a certain period of time?

    Share
  7. I agree this is an important problem. An individual can takes some steps to deal with it prospectively: the excellent pinboard is archiving my tweets for me as I write, though good luck with downloading your data from Facebook, which claims to offer this service, though ItDidn’tWorkForMe™ when I tried it.

    The notion that past states of consciousness of the polity will only available to business, and at a price, is pernicious and needs to addressed. Hopefully the 2.0 equivalent of archive.org will arrive. But who’ll back that up? These single points of failure are both technically and politically unacceptable.

    Users will undoubtedly find such digital amnesia highly unsettling, with the effect of infuriated abandonment of service providers who do not address this seriously. Tech stockholders take note!

    I see a bright future for smaller service providers and data collectives for those who wish to take back some control of their data, but wish to avoid BigCorp solutions. This will have the advantage of introducing richer variety to the web ecosystem, and hopefully make it less vulnerable to failure.

    Share
  8. I agree this is an important problem. An individual can takes some steps to deal with it prospectively: the excellent pinboard is archiving my tweets for me as I write, though good luck with downloading your data from Facebook, which claims to offer this service, though ItDidn’tWorkForMe™ when I tried it.

    The notion that past states of consciousness of the polity will only available to business, and at a price, is pernicious and needs to addressed. Hopefully the 2.0 equivalent of archive.org will arrive. But who’ll back that up? These single points of failure are both technically and politically unacceptable.

    Users will undoubtedly find such digital amnesia highly unsettling, with the effect of infuriated abandonment of service providers who do not address this seriously. Tech stockholders take note!

    I see a bright future for smaller service providers and data collectives for those who wish to take back some control of their data, but wish to avoid BigCorp solutions. This will have the advantage of introducing richer variety to the web ecosystem, and hopefully make it less vulnerable to failure.

    Share
  9. The underlying assumption behind the “alarm” over disappearance is that all information is/should be persistent (and by implication has constant value over time.). It doesn’t – the world is built on decay (entropy…). There are multiple information-decaying forces at work. For example, most information exponentially diminishes in value over time (as do most people – they tend to eventually die). There are of course similar, opposite forces driving the preservation impulses: information (being remembered ans sending messages over time) is the only practical way people have found to achieve a tiny bit of immortality. Just two of many interesting properties of information that we tend to overlook: http://goo.gl/n6pHW.

    Share
  10. The source post for this is alarmist and missing a key piece of information.

    Tweets aren’t disappearing. Access, however, is variable.

    These researchers didn’t research their subject very thoroughly.

    Share

Comments have been disabled for this post