3 Comments

Summary:

Holding onto millions of pieces of archived content it still wanted to monetize, the Associated Press turned to MarkLogic’s NoSQL non-relational database designed for XML files. As publishers try to leverage their years worth of archived, often not tagged content, they’ll need new tools.

newspapers

Holding onto millions of pieces of archived content it still wanted to monetize, the Associated Press turned to a NoSQL database. Specifically, it turned to MarkLogic, a non-relational database designed for storing and accessing lots of XML-based content — like the stuff the AP has lying around — and that has already earned itself quite a following among media companies.

What the AP wanted to do, VP of information management Amy Sweigert told me, is build an application that would let it search through its mountains of archived content so it could better analyze that information. Internally, the organization AP wants to better understand how much content it’s publishing on any given topic and in what formats (e.g., stories, photos, videos), but it also wants to deliver custom data sets to business-to-business customers based on whatever their needs might be.

According to Sweigert, the AP had to go with a non-relational database for a variety of reasons, with scale and freedom from schemas being chief among them. Her team actually had built a relational database, but as content volumes grew (the new system holds about 120 million pieces of content) and the team wanted the flexibility to perform new types of searches without complicated queries and — more importantly — without having to reconfigure the database to support new methods of searching, the old database had to go.

Sweigert said many large publishers are moving toward an XML-centric data model, if they’re not already there, because the format makes it so much easier to work with old content that doesn’t necessarily have metadata associated with it. What’s more, she said, the AP is actually using MarkLogic to help add metadata to some of that old content.

In that regard, the AP’s new database sounds similar to the value proposition for publishing analytics tools like Parse.ly, which launched earlier this year and already has some big-name clients under its belt. Parse.ly analyzes clients’ web content based on the text rather than the metadata, which means publishers without strict metatagging procedures or crack data analysts can still get deep insights into what topics are driving traffic.

However they do it, the rationale is the same: find a way to keep making money off of years worth of archived content, either directly or indirectly. The direct route is probably akin to what the AP is doing with its business partners, while the indirect route is the same story as any analytics effort. That being to use older content to help identify trends that can influence future decisions on both content and products.

Image courtesy of Flickr user DBduo Photography.

  1. > In that regard, the AP’s new database sounds similar to the value proposition for publishing analytics tools like Parse.ly, which launched earlier this year and already has some big-name clients under its belt. Parse.ly analyzes clients’ web content based on the text rather than the metadata, which means publishers without strict metatagging procedures or crack data analysts can still get deep insights into what topics are driving traffic.

    Derrick really… I suggest you do better research for your articles.

    There are superior methods to auto-provision infrastructure and software for Linked Data services at the click of a button, to dynamically discover and virtually integrate through federation private and public data sources ( from SQL, NonSQL to RDF/ Linked Data), and then with the self-service UI concept to enable users to build interfaces and visualizations on top of the data without having to do any programming.

  2. Aaron Rosenbaum Tuesday, March 20, 2012

    Steve,

    I don’t think fluid operations (your company) or parse.ly enrichment/tagging platforms negate use of a robust scale-out database like MarkLogic. I don’t think the customers are confused (AP joins a staggering number of the worlds top publishers who are running MarkLogic.)

    1. Hi Aaron – I didn’t say anything about negating use of Mark Logic which as is robust and widely used. In fact BBC is using it along with a Ontotext RDF triplestore – see Dynamic Semantic Publishing Empowering the BBC Sports Site and the 2012 Olympics http://bit.ly/zIbosW. Ergo my point to Derrick on methods to virtually integrate through federation private and public data sources from SQL, NonSQL plus RDF/ Linked Data.

Comments have been disabled for this post