30 Comments

Summary:

The ability to distribute real-time information through social networks like Twitter is a powerful thing, but a new study points out that one of the downsides of this phenomenon is the fact that much of the content that gets linked to eventually disappears.

shutterstock_111156467

One of the characteristics of the modern media age — at least for anyone who uses the web and social media a lot — is that we are surrounded by vast clouds of rapidly changing information, whether it’s blog posts or news stories or Twitter and Facebook updates. That’s great if you like real-time content, but there is a not-so-hidden flaw — namely, that you can’t step into the same stream twice, as Heraclitus put it. In other words, much of that information may (and probably will) disappear as new information replaces it, and small pieces of history wind up getting lost. According to a recent study, which looked at links shared through Twitter about news events like the Arab Spring revolutions in the Middle East, this could be turning into a substantial problem.

The study, which MIT’s Technology Review highlighted in a recent post by the Physics arXiv blog, was done by a pair of researchers in Virginia, Hany SalahEldeen and Michael Nelson. They took a number of recent major news events over the past three years — including the Egyptian revolution, Michael Jackson’s death, the elections and related protests in Iran and the outbreak of the H1N1 virus — and tracked the links that were shared on Twitter about each. Following the links to their ultimate source showed that an alarming number of them had simply vanished.

After two and a half years, 30 percent had disappeared

In fact, the researchers said that within a year of these events, an average of 11 percent of the material that was linked to had disappeared completely (and another 20 percent had been archived), and after two-and-a-half years, close to 30 percent had been lost altogether and 41 percent had been archived. Based on this rate of information decay, the authors predicted that more than 10 percent of the information about a major news event will likely be gone within a year, and the remainder will continue to vanish at the rate of .02 percent per day.

It’s not clear from the research why the missing information disappeared, but it’s likely that in many cases blogs have simply shut down or moved, or news stories have been archived by providers who charge for access (something that many newspapers and other media outlets do to generate revenue). But as the Technology Review post points out, this kind of information can be extremely valuable in tracking how historical events developed, such as the Arab Spring revolutions — which the researchers note was the original impetus for their study, since they were trying to collect as much data as possible for the one-year anniversary of the uprisings.

Other scientists, and particularly librarians, have also raised red flags in the past about the rate at which digital data is disappearing. The National Library of Scotland, for example, recently warned that key elements of Scottish digital life were vanishing into a “black hole,” and asked the government to fast-track legislation that would allow libraries to store copies of websites. Web pioneer Brewster Kahle is probably the best known digital archivist as a result of his Internet Archive project, which keeps copies of websites dating back to the early days of the web (Kahle also has a related project called the Open Library).

Getting access to social data is not easy

Although the Virginia researchers didn’t deal with it as part of their study, a related problem is that much of the content that gets distributed through Twitter — not just websites that are linked to in Twitter posts, but the content of the posts themselves — is difficult and/or expensive to get to. Twitter’s search is notoriously unreliable for anything older than about a week, and access to the complete archive of your tweets is only provided to those who can make a special case for needing it, such as Andy Carvin of National Public Radio (who is writing a book about the way he chronicled the Arab Spring revolutions).

As my colleague Eliza Kern noted in a recent post, an external service called Gnip now has access to the full archive of Twitter content, which it will provide to companies for a fee. And Twitter-based search and discovery engine Topsy also has an archive of most of the full “firehose” of tweets — although it focuses primarily on content that is retweeted a lot — and provides that to companies for analytical purposes. But neither can be linked to easily for research or historical archiving purposes. The Library of Congress also has an archive of Twitter’s content, but it isn’t easily accessible and it’s not clear whether new content is being added or not.

Twitter has talked about providing a service that would let users download their tweets at some point, but it hasn’t said when such a thing would be available — and even if users did create their own archive in this way (or by using tools like Thinkup from former Lifehacker editor Gina Trapani) it would be difficult to link those in a way that would provide the kind of connected historical information the Virginia study is describing. And it’s not just Twitter: there is no easy way to get access to an archive of Facebook posts either, although users in Europe can request access to their own archive as a result of a legal ruling there.

For better or worse, much of the content flowing around us seems to be just as insubstantial as the clouds that it is hosted in, and the existing tools we have for trying to capture and make sense of it simply aren’t up to the task. The long-term social effects of this digital amnesia remain to be seen.

Post and thumbnail images courtesy of Shutterstock user Ribah and Flickr user See-ming Lee

You’re subscribed! If you like, you can update your settings

  1. We have an app called Mettya (www.mettya.com) that solves this problem and a few others. It allows you to pull photos/videos from across the social web through keywords. The content is actually stored in the database so you can come back to it anytime. http://youtu.be/IwlOEVYEKmM And a vimeo http://vimeo.com/48667055

  2. While that sounds like a lot of lost information, considering the amount of information being created, that is still a lot more saved than lost, by several magnitudes, as compared with classical media just 50 years ago.

    More importantly, we need to distinguish between information and knowledge. The value of information is in every byte, while the value in knowledge is in the actual message. As knowledge is what truly allows for human development and societal advancement, this is what we should ensure does not get lost. In my opinion, the “weeding” out of some of the information actually aids in removing the haystack of information to find the needle of knowledge. Of course, some of the lost information may encompass a complete message of knowledge, but thinking in these terms will allow us to focus our energy on the actual goal, instead of the likely impossible goal of total recall.

    1. You are so right. The information has not disappeared. We live in an era where vastly more is recorded than ever before.

      Even today, most conversations are not recorded, but many more are.

      And the presumption here appears to be that a broken link means the data is gone? That’s just not how it works. Old links that lead to a “file not found” error often leave easy clues to where that data is. It could have been removed (a better word than “disappeared”) but quite likely that redirects changed, or a page file was relabeled and still exists on a drive somewhere. A few seconds search with the clues from the social media post and some common sense will find the source quite often.

  3. But do we need to keep everything? That’s the question. We don’t need to keep all of the day-to-day press so why keep every other little piece of data?

    1. We almost certainly don’t need to keep everything. The problem is that it’s virtually impossible to tell what historians will find significant in future – five, ten, twenty, fifty, a hundred years from now. Archaeologists can learn a lot from middens (rubbish heaps)! Sometimes the things we regard as insignificant now are very telling to future historians.

  4. I created an IFTTT recipe to archive my -most important- tweets.
    Regards,
    Michele (@steccami)

    1. Didn’t Twitter just shut that firehose down though?

      1. Yes, they did. That IFTTT recipe will soon be deleted.

  5. If only this was true. The problem now is that EVERYTHING is kept. Just cos you can’t see it doesn’t mean it’s not there. Give Google a shout. There’s far too much data and not enough intelligent humans to interpret it sensibly. We’re not meant to keep everything and one day, when the semantic web becomes a reality, it will come back to bite us.

  6. I can’t see how 30% of information being un-available after two and half years means that much of the content we use has been lost?

    I’d echo the comments below and go further and ask if only 30% data loss is enough? I’d like a way of setting a decay date in data – especially on sites like facebook and twitter to say don’t show this after x months.

    In a similar way should data be anonymised after a certain period of time?

  7. I agree this is an important problem. An individual can takes some steps to deal with it prospectively: the excellent pinboard is archiving my tweets for me as I write, though good luck with downloading your data from Facebook, which claims to offer this service, though ItDidn’tWorkForMe™ when I tried it.

    The notion that past states of consciousness of the polity will only available to business, and at a price, is pernicious and needs to addressed. Hopefully the 2.0 equivalent of archive.org will arrive. But who’ll back that up? These single points of failure are both technically and politically unacceptable.

    Users will undoubtedly find such digital amnesia highly unsettling, with the effect of infuriated abandonment of service providers who do not address this seriously. Tech stockholders take note!

    I see a bright future for smaller service providers and data collectives for those who wish to take back some control of their data, but wish to avoid BigCorp solutions. This will have the advantage of introducing richer variety to the web ecosystem, and hopefully make it less vulnerable to failure.

  8. I agree this is an important problem. An individual can takes some steps to deal with it prospectively: the excellent pinboard is archiving my tweets for me as I write, though good luck with downloading your data from Facebook, which claims to offer this service, though ItDidn’tWorkForMe™ when I tried it.

    The notion that past states of consciousness of the polity will only available to business, and at a price, is pernicious and needs to addressed. Hopefully the 2.0 equivalent of archive.org will arrive. But who’ll back that up? These single points of failure are both technically and politically unacceptable.

    Users will undoubtedly find such digital amnesia highly unsettling, with the effect of infuriated abandonment of service providers who do not address this seriously. Tech stockholders take note!

    I see a bright future for smaller service providers and data collectives for those who wish to take back some control of their data, but wish to avoid BigCorp solutions. This will have the advantage of introducing richer variety to the web ecosystem, and hopefully make it less vulnerable to failure.

  9. The underlying assumption behind the “alarm” over disappearance is that all information is/should be persistent (and by implication has constant value over time.). It doesn’t – the world is built on decay (entropy…). There are multiple information-decaying forces at work. For example, most information exponentially diminishes in value over time (as do most people – they tend to eventually die). There are of course similar, opposite forces driving the preservation impulses: information (being remembered ans sending messages over time) is the only practical way people have found to achieve a tiny bit of immortality. Just two of many interesting properties of information that we tend to overlook: http://goo.gl/n6pHW.

  10. The source post for this is alarmist and missing a key piece of information.

    Tweets aren’t disappearing. Access, however, is variable.

    These researchers didn’t research their subject very thoroughly.

  11. vancouver gadgets Thursday, September 20, 2012

    I think this post is taking a bit of an alarmist viewpoint. To state that “information decay is eating away our history” is a little too extreme considering the amount of data we now accumulate. Do we really need to archive every single conversation we have on every possible network? I don’t think future historians will have a problem documenting this point in history, in fact I beleive they will find there was a glut of content. Perhaps we need robots now more than ever to filter through and archive what we have accumulated as its no longer possible for a human or a group of humans to parse through all this data; there’s just too much.

    1. I think the anxiety that the article is trying to explicate (but doesn’t) is that we don’t have control of exactly what disappears and the false promise of “everything’s saved online” and true realization of “there’s too much out there” blinds us to how much is actually missing without anyone paying attention until it’s too late.

      Beyond “valuable tweets” or Facebook updates, 10% of news, academic research that is (increasingly) accessible only online, and real-time conversations and analysis is disappearing a year as well. How do we identify what we _want_ to have preserved? Or do we just wait to see what serendipity left us in 5 – 20 – 100 years? And hope it’s what we wanted?

  12. Its a good thing. Cream rises to the top and all the crap disappears!

  13. I think the problem is not just confined to social media and dissapearance of some web content, but the fact that even if we archive some of the web content, we will probably have no mechansim to view them in the future. Try viewing archived copies of a site on archive.org, and you will often notice that a 2001 page is still viewable, but a 2011 page from the same site is badly broken. This is due to rapidly changing technologies. I wrote a recent blog on this that some of you may find interesting — http://www.kunalsen.com/blog.html?entry=near-impossibility-of-archiving-web

  14. Should there not be a wikinews or wikiarchive be out there

  15. Is history found in history books: or really letters..old catalogues..cereal boxes…….

  16. @Blake – the links are changing / disappearing, not just twitter content itself.
    @Douglas – I have downloaded my data from Google Takeout as well as Facebook.

    Gnip is fascinated by “the conscious thought of the world” being available since 2006,
    well at least some part of the digital world, science wants to create life, and says the
    possibilities are “endless”, sounds like a fantasy, what of this digital data will be useful
    in 100 years? will it say more than what we were seeing about ourselves?

    history is in the hands of the big few companies, whose hands are they in?

  17. Definitely a subject not talked about enough in mainstream media as we move off local into the cloud – with fragmentation and lack of control only accelerating. There were probably ~100-200 people, mostly academics, at the Personal Digital Archive conference at the Internet Archive the past 2 years. Brewster Kahle & Jason Scott (@textfiles) are taking on a huge task and I applaud them and support their efforts.
    We at @miLifemap launched into beta a personal storage, backup and archive app for your meaningful memories. We allow our members to upload photos, videos, and write diaries while backing up their tweets (until Twitter boots us), FB, Flikr & 4sq all on an interactive timeline and searchable. It is quite powerful in terms of having a running archive and to control your self generated content that in turn becomes your life story. You can set up an eBeneficiary allowing for control to be maintained by a family member posthumously.
    We’re still in the early days, the wild west, and I look forward to the day when the user sits in the centre and controls their data and web experience.

  18. Conversations have always been ephemeral. Before the digital age nearly 100% of them were lost. We have substantially more information now than ever before, obviously. Is there a need to record for history 100% of everything ever composed by everyone? I think not.

    Before the Web, there was likely one version of any newspaper article that survived, the final edition. The earlier editions would likely not be kept in the paper’s morgue as they would have taken up too much room. Now we have a much greater chance of seeing stories as they grow because they are replicated by so many companies (the Factivas and LexisNexiss of the world, for example). Even if there’s not (yet) one place that has everything nor a place where you can find everything for free.

  19. Just that now we are capable of storing tera/exa/zeta bytes of data should we really store all of them ? Or should we look at storing only the really important ones ? However, how do we decide what is important now ? Well we can argue for both sides. On one side, much of Twitter data is temporal and devoid of any information. It’s often just like our daily casual conversations. Even though technology exists we still do not record every moment of our life. We choose life events _to record judiciously. On the other side, it is impossible to predict now how the data will be used in the future. So let us store it all. Technology will surely provide us enough computing power to process zetabytes in seconds.

  20. While you’re right that verbal conversations were lost, in the past people *wrote* things, like letters home from a soldier, lengthy essays reprinted in handouts or pamphlets, etc., etc. These are what historians refer to as “primary sources,” the actual accounts of people who were actually there, which is used to piece together a (hopefully) accurate and nuanced history. Now that nearly all our interactions are digital, these primary sources are drying up, and that does create a problem for future historians.

  21. “Disappearing Web: Information decay is eating away our history”
    So, who’s to blame? Archivist; reposting; error corrections; historical revisionists?
    Not clear on the metrics/methods used to discover this trend but this is interesting and convincing reading.

  22. Hitachi hopes to bring a new data storage system to the market by 2015. The technique uses a laser to etch data onto four layers of quartz glass, which is extremely durable to water and heat.
    http://qz.com/7918

  23. I wonder when this article itself will disappear.

  24. I hear what you are saying. To be honest-I like physical books. Old books and new books. As one of my college history professors has said. For information, past books are just as important as newly written. Take three on same subject and read it and see what fits logically so to speak. I held on to that specific idea. I wont forget the importance of an old book and a new author. Theres wisdom to be found in a creative way! I like what you have said above….. ” That’s great if you like real-time content, but there is a not-so-hidden flaw — namely, that you can’t step into the same stream twice, as Heraclitus put it. In other words, much of that information may (and probably will) disappear as new information replaces it, and small pieces of history wind up getting lost. According to a recent study, which looked at links shared through Twitter about news events like the Arab Spring revolutions in the Middle East, this could be turning into a substantial problem.

  25. Ariadne Etienne Tuesday, November 6, 2012

    I noticed this problem way back in 1995, which is why I generally download whatever information or even entire site I find useful, when I find it. I insist that the usefulness of the Internet as a tool is severely hampered by the fleeting nature of information stored upon it, and this will not be remedied until such time as public servers (nationalized, if you will) exist, and corporations are put in their place with regard to copyright law, censorship, and denial of service. If one takes a close look at human history its the information which makes us human, it’s our culture, it’s our life blood, and today corporations and near-sited neo fascist governments are chipping away at that.

Comments have been disabled for this post