8 Comments

Summary:

Part of a data scientist’s role is to automate his or her own work, leading to tools like prediction APIs. But these APIs are starting to replace data scientists in some areas, and that should be considered a good thing.

We are now in the era of big data 2.0, as defined by Foster Provost and Tom Fawcett in Data Science for Business. There’s growing interest in predictive analytics solutions powered by machine learning. As InsightsOne’s CEO Waqar Hasan puts it: “Predictive is the ‘killer app’ for big data.” Quite interestingly, McKinsey & Company predicted a shortage of machine learning talent in the coming years, and at the same time, we started to see services that made machine learning and predictive analytics accessible to the masses. We’re seeing more and more of these services: Apigee launched one last April, just a couple of months after buying InsightsOne.

One of the first things I learned when I took computer science classes at university was that it was our job to “put ourselves out of business.” There are things that we do by hand, and our job as computer scientists is to make programs that do the same things, then other programs that replace them and that are quicker, more reliable, require less maintenance, and so on. The same applies to data science.

Technology that replaces data scientists

Most of a data scientist’s time is spent creating predictive models: finding the variables that matter to make predictions, the right type of model, the best set of parameters, etc. Work is being done to automate all of this, and so far it has resulted in solutions such as Emerald Logic’s FACET and in the creation of prediction APIs such as Google’s and Ersatz Labs’. These APIs abstract away the complexities of learning models from data. You can just focus on preparing the data (collecting/enriching/cleaning it), you then send that data to the API, it automatically creates a model, and it uses that model when you ask for predictions.

These new tools imply a new paradigm in which no data scientist is involved, but everyone else in the company is: business execs set a vision, managers define specs for integrating predictions, software engineers work on implementation. This requires that everyone knows a bit about machine learning, but that can be rapidly learned even by non-technical types once you skip algorithms and theory to only focus on studying the core concepts, intuitions and possibilities of machine learning, and some key examples.

Actually, if the domain experts are in charge, they will have more chances of incorporating domain knowledge into the system they’re building, by picking the right representations of the domain (“features”) to use to make better predictions.

It can only go further

Machine learning is a set of artificial intelligence techniques where “intelligence” is built by referring to example data.

We’re building artificial intelligence but still need manual model/algorithm selection and tuning? Surely we can come up with an intelligent and automatic way to do this! Hence a trend in artificial intelligence of building “meta AI algorithms” whose job is to find the right AI algorithm with the right parameters for a given problem.

The way to do this in machine learning can be principled, such as with probabilistic inference for setting parameters and finding weights to assign to features. It can also be brute force, and the computational power that we have today allows to rapidly test a multitude of possibilities and see what works well. This brute force can be regular cross-validation or it can be based on evolutionary techniques, as is the case with Emerald Logic’s FACET.

Detecting (and thus exploiting) domain specificities by looking at data starts with simple things. For instance, if we see data for a binary classification task in which classes are strongly unbalanced, then an anomaly detection algorithm should be used.

What will data scientists do in the future?

Many will say that there’s only so much you can automate. Sure, there are areas of ML where automating all is still very hard, but there are also cases where prediction APIs work very well in comparison with “traditional” techniques. The value to be created right now in those cases is huge.

Because of the new tools, the role of the data scientist is evolving. It might be getting easier to become one, but what’s for sure is that doing the same job as a data scientist is getting even easier, thanks to prediction APIs. Their jobs end up being performed by database and software engineers, which has led some to go as far as to say that data science is not a real thing. I would just say that data science is evolving.

In the prediction APIs world, data scientists still have a role in helping teams use these APIs and become autonomous. If their expertise is needed, it should be in supervising roles, and they should be much less involved than previously, for a similar result.

Most importantly, data scientists should keep working on automating more machine learning techniques. It’s encouraging to see that, after supervised learning, we’re now seeing reinforcement learning APIs. Also, work is needed to create simple formalisms that allow domain experts to describe more specificities of the domain and encode their knowledge into algorithms.

“If we can get usable, flexible, dependable machine learning software into the hands of domain experts, benefits to society are bound to follow.” — Dr Kiri L. Wagstaff, researcher at NASA JPL

Louis Dorard is the author of Bootstrapping Machine Learning, the first guide to prediction APIs. He helps companies exploit the value of their data. You can follow him on Twitter @louisdorard.

Feature image courtesy of Shutterstock / phipatbig

  1. Steve Ardire Wednesday, May 7, 2014
    Share
    1. Steve Ardire Wednesday, May 7, 2014

      “If we can get usable, flexible, dependable machine learning software into the hands of domain experts, benefits to society are bound to follow.” — Dr Kiri L. Wagstaff, researcher at NASA JPL

      you nailed it @louisdorard

      Share
      1. Louis Dorard Thursday, May 8, 2014

        Thanks Steve!

        Share
  2. “Part of a data scientist’s role is to automate his or her own work,” The problem is that most of these people are analysts and thus not developers (what you call computer scientist???) and thus are not very good programming let alone at automation and building APIs. Partially because “most of a data scientist’s time is spent creating predictive models” (your words) and partially because most developers are not very good at automation/APIs. My experience tells me that the data scientists are working with people capable of doing this. (I am sure that up cam come up with an example of the exception, not the rule).

    Share
    1. Louis Dorard Thursday, May 8, 2014

      True. To create prediction APIs you need to bring together different talents, not just Data Scientists or ML experts. This is what BigML have managed to do. Software engineering is key.

      Share
  3. Andrew Stewart Thursday, May 8, 2014

    It depends on how you define data scientist. If you’re just talking about BI and analytics folks then sure there’s a ton that can be automated there, but if we’re talking about data scientists as applied AI research scientists, then those jobs aren’t being automated any time soon. They’ll probably be the last to be automated.

    Share
  4. Color me unimpressed. In the BigML guy’s blog it shows him choosing a random forest classifier, and then choosing to increase the number of trees learned (over the default of 10). How is this revolutionary “prediction API” different from any other machine learning workbench out there? (and don’t just tell me: “because it’s in the cloud”). I thought we didn’t have to worry about manually tuning parameters anymore now :-) Or have I missed something fundamental here?

    Share
    1. Louis Dorard Thursday, May 22, 2014

      BigML can also work in a fully automated way. But if you want, you do have the option to tweak a few parameters when creating your predictive model. Same thing with Ersatz Labs.

      To me, the fact that it runs in the cloud is a big deal. It means that 1) you don’t have to worry about infrastructure or scalability, 2) you can handle huge amounts of data, 3) adoption is facilitated since there’s nothing to install.

      Share

Comments have been disabled for this post