Blog Post

Why becoming a data scientist might be easier than you think

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Maybe the business world has jumped the gun with all the talk about a looming skills shortage in big data and advanced analytics. There’s mounting evidence that it doesn’t take much to turn a novice programmer or statistician into a perfectly capable data scientist. Maybe all it takes is just some cheap cloud computing servers, or a few weeks studying machine learning with Stanford professor Andrew Ng on Coursera.

Much of this evidence comes via Kaggle, a platform where companies and organizations award prizes for the best solutions to their predictive-modeling needs. In September, for example, I covered a first-time Kaggle user and admitted data science neophyte named Carter S. who won a competition using a simple but effective method he dubbed “overkill analytics.”

Impressive, sure, but Carter builds insurance-industry risk models for a living. While he’s able to learn new techniques such as natural-language processing and social network analysis as he goes, he’s no stranger to a linear regression. But what if someone’s only formal experience with computer science was a single undergraduate programming course?

Ask Luis Tandalla. That was his case before he took a handful of free online classes last year on Coursera. Yet the University of New Orleans senior recently scored his first victory in a Kaggle competition hosted by the Hewlett Foundation where he had to devise a model for accurately grading short-answer questions on exams. Not bad for a college senior who didn’t really know what artificial intelligence and machine learning were before he signed up to learn them.

Once Tandalla got started, he told me, he got passionate about learning more. So he also took Coursera classes on natural-language processing and probabilistic models, began studying on his own outside the online lectures and even got active on Kaggle (this was his first victory in five competitions). He’ll receive his bachelor’s degree in mechanical engineering in May 2013, but now Tandalla says he wants to pursue a master’s degree in machine learning and start his own predictive-software company

The Coursera connection

Maybe Tandalla isn’t so unique after all. The second- and third-place finishers in the Heritage Foundation competition, it turns out, also learned machine learning on Coursera. The latter, Xavier Conort, is a 39-year-old actuary from Singapore who just decided to become a data scientist last year and is now Kaggle’s top-ranked competitor.

Andrew Ng

Stanford professor and Coursera co-founder Andrew Ng — who teaches the machine-learning class that all three top finishers took — doesn’t think their success is just coincidence. If you’re not trying to make the types of contacts students at top universities are after, and your goal isn’t to perform advanced research, he explained, online education platforms such as Coursera (and, I’ll add, Udacity and EdX), can be incredibly valuable.

In particular, Ng said, “Machine learning has matured to the point by where if you take one class you can actually become pretty good at applying it.” Familiarity with algebra and probabilities are certainly helpful, he added, but the only real prerequisite to his course is a basic understanding of programming.

And with machine learning becoming “one of the more highly sought-after skills in Silicon Valley,” Ng said, corporate recruiters say just completing a single course can significantly boost someone’s salary and job prospects at companies where such knowledge is still in short supply.

“I bet many students are going on to [do] great things because of these courses [even if we never hear about it],” Ng said.

Why it works, and why it could change the world

Ng thinks the current incarnation of online education platforms work so well because they’re essentially nurturing the already-talented students who seek them out. Some professionals, he explained, take courses to learn skills such as machine learning or iOS (s aapl) programming that weren’t in vogue or didn’t even exist when they earned their computer science degrees just a decade ago.

Furthermore, with students able to learn at their own pace, there’s a lot of valuable information disseminated in the discussion forums.

Free access to the best teachers around doesn’t hurt either. Ng said he couldn’t teach his course so well if he hadn’t spent so much time living in Silicon Valley learning best practices from some of the smartest computer scientists on the planet. That experience lets him spend less time teaching algorithms for the sake of algorithms and more time talking about how one might actually apply machine learning in the field.

Ng says that’s a more important than just understanding the algorithms in a vacuum. He compares it to learning how to write a computer program instead of just learning the syntax of a programming language but not being able to string commands together into something useful. This approach isn’t entirely unique among the new order of online educators: On Udacity, for example, Google (s g00g) VP and Stanford professor Sebastian Thrun, centers the Computer Science 101 curriculum around learning Python in the context of building a working search engine.

The value of this opportunity wasn’t lost on Tandalla. He said he can feel the passion that professors have even through the pre-recorded video lectures, and it feels good knowing you’re learning from the people who literally wrote the book on the subject you’re studying.

Who knows who’s the next Einstein

But ultimately, minting new data scientists — even Kaggle winners — is low-hanging fruit. Ng said we don’t yet know how much impact online educations platforms like Coursera can have. In all fields, there are talented people all over the world who just need an avenue to hone their skills and a chance to distinguish themselves.

“It makes me wonder,” Ng said, “if the next Albert Einstein is a little girl in Afghanistan who just needs [the opportunity to access quality education].”

24 Responses to “Why becoming a data scientist might be easier than you think”

  1. Nagwa Abou El-Naga

    Thanks to Professor Ng, who touches on an essential issue talking about the many talented people all over the world who just need an opportunity to distinguish themselves. In my teaching career I found that many students would die to find an affordable learning chance but their dreams ended up because of their financial situations that made them not able to cover the tuition fees and the long distance to reach places in the world that offer advanced learning experiences.
    Nagwa Abou El-Naga

  2. That’s exactly why these online classes are so good — smart people are everywhere and most of them, for different reasons, are not lucky enough to attend top universities. It makes the knowledge accessible to all.
    Does that mean it’s easy to be a great data scientist? of cours not. ML is hard and applying it successfully in the real world is even harder. Not to mention that a data scientist needs to know
    other stuff besides ML. It’s not just about collecting data and running your favorite algorithm in Weka.

  3. Tamim Shahriar

    I along with 4 colleagues of my company also took the free ML and AI class offered last year by Andrew and Sebastian / Norvig. We completed both the courses and the ML course was really helpful. It introduced me to a lot of things I wasn’t familiar with (I am a CSE graduate but didn’t have ML course in my undergrad programe). After the completion of the courses, my colleagues also participated in a contest in Kaggle though didn’t win. The outcome of these courses is, I am working remotely (from Dhaka, Bangladesh) for a US startup and helping them build their product recommendation engine. Though it’s not a rocket science, but I got the idea and confidence to work in this field after I took the ML course. I am also doing another course on Big Data and Web Intelligence.

    Kudos to coursera and other similar online education ventures.

    • Derrick Harris

      All fair points. However, I’m not talking about someone with no knowledge of anything taking an ML course and becoming a skilled data scientist. I’m talking about people with some data background and a little programming skill. They might never become data thought leaders, but they can become very valuable employees.

    • Jin-Gap Kim

      First and foremost… this Coursera ML class is not about unleashing full-fledged data scientists to this world. It’s but one baby step toward creating capable machine learning programmers, if they are willing to invest their time and effort into it.

      As such, while I find that all your points are valid, I don’t particularly think your points are relevant to those who take coursera machine learning class. I do feel, however, that anybody serious about becoming a data scientist should be aware of the things you have pointed out.

    • Joseph I found your blog post very informative. I agree with you that a course like this isn’t going to make you a fully fledged data scientist, but it would make it much easier to converse with an experienced data scientist, as part of a team or on the client-side.

      I would be interested to know what you would advise to someone commissioning data analysis, who is expert in their own field but not in data science or programming.

    • Derrick Harris

      I don’t know if that’s sarcastic or not. Either way, though, I don’t think anyone thinks that’s necessarily true. But if you just need to learn the basics, even some relatively advanced stuff, this can be a great way.

      Lecture-wise, an online course is just a bit shorter than in-person each week — trimming out repetition, questions of the class, etc. — so it requires some time investment and probably extra-curricular studying to do it right. But the real benefit in earning an advanced degree is probably in the networking, research, course diversity, etc.

      • I have a PhD in mathematics, and I haven’t yet decided whether it’s a waste of time, or not. I loved learning mathematics, but now that I have graduated, I haven’t figured out what to do with myself. Perhaps a machine learning course is in order.

      • Brilliant article, thank you Derrick. These free online courses offer an amazing opportunity. Particularly for parents who are stuck at home and don’t have much money, and for whatever reason need to create a change in their career path.

        My husband and I (architect & engineer) are taking a two EdX and a Coursera course right now (Computer Science at Berkeley, AI at Harvard and Social Network Analysis at Michigan). We’re working on them together instead of watching TV in the evenings when our kids are in bed. We don’t have much money as we’re starting up a startup together and we chose to spend more time with our kids while they are small. It’s a lot of hours, but it’s fascinating and you just go at your own pace.

        We were surprised at how engaging and funny the professors are, and how relevant the applications are. We’re amazed at what can be done with these tools now. We will both be able to apply the knowledge into our fields of architecture and engineering – or conversely, to apply into software development our combined knowledge of ‘IRL’ design, construction, project management, multidisciplinary team management and client interaction.

        A word of warning: these courses are addictive!

        [Shout to our brilliant professors: Coursera MichiganU Social Network Analysis = Lada Adamic, EdX Berkeley AI = Dan Klein, EdX Harvard CS50 = David Malan]

    • Derrick Harris

      The top three Hewlett finisher took machine learning w/ Andrew Ng. I don’t know about the rest, but Luis also took AI, probabilistic models (I believe) and NLP.

  4. The article misses the contribution of open-source machine-learning software in reducing the barrier to entry to machine learning. Very often you can treat machine learning algorithms as black boxes and simply need to understand the inputs and outputs.

  5. I think that Kaggle is an immature way of defining a data scientist because many Kaggle problems are clearly defined, i.e. the data is in place and there is a metric that needs to be minimized. Data scientists need to have some sort of intuition about how to quantify causation vs. correlation, and the intuition is tougher to build through Coursera’s ML course. The real value data scientists provide is figuring out the correct questions to ask that can be answered by the data, and have the vision to see implementable results from raw data. That being said, it’s pretty damn easy to get that start with the toolbox.

    • Derrick Harris

      Good comment, and I agree to a degree. Kaggle competitions don’t necessarily align with the accepted definition of what a data scientist does, but they certainly require exercising part of that skill set.

      The bigger picture, though, probably is online education platforms such as Coursera et al. If you’re already working with a company’s data and know the business inside and out, you can become dangerous with the right new skills.

      • dylanhogg

        The best “definition” of a data scientist that I’ve found is the venn diagram originally created by Drew Conway (I may be wrong there) that encompasses 1. math/stats knowledge, 2. hacking/programming skills and 3. domain expertise.

        In my experience Kaggle predictive analytics comps allow people to exercise the first 2 of these skill sets while the third can be helpful but is non essential.

      • dylanhogg

        Is there an accepted definition of a data scientist? The best definition that I’ve found is summarised by the Venn diagram originally created by Drew Conway (I may be wrong there) that encompasses 1. math/stats knowledge, 2. hacking/programming skills and 3. domain expertise.

        In my experience Kaggle predictive analytics comps allow people to exercise the first 2 of these skill sets very well, while the third can be helpful but is non essential.

    • I agree with you. The value of so called “Data Scientist”, is on how much value, namely the $$$, he/she can bring in based on data analysis. This is a combination of subject matter knowledge, analytic skill set, the capability to execute analytic plan in computing environment and transform analytic output into $$$ implications.

      Kaggle does has it value in two folds. First, it is a power node in the social marketing network of the term Data Scientist. Second, it does specifically address part of the analytic skill set in the required portfolio