This week Cloudera made waves by announcing a very strategic new feature plugged into their Hadoop platform. The press release touts the launch of a tool for “Self-Service Data Science for the Enterprise” providing a native interface for Machine Learning on Hadoop. I think it’s important to give this some voice on the blog because this falls right in line with a lot of trends right now in the enterprise big data landscape.
All data-service/data-tech companies are working to find a niche in the new AI/ML/Data Science world as some of the attention and hype grows around the application of these tools in the enterprise. Most industries haven’t really integrated fully with Machine Learning because of the lack of data science talent across the company. Very few organizations can claim that they have data scientists in every department, and those few are probably all consulting firms.
What’s interesting about this is that they are delivering this capability inside the browser, like an iPython/Jupyter notebook. This kind of tooling is very popular in the open source community and with data-oriented developers but definitely not the kind of thing we’re used to seeing in enterprise. I personally love to use notebooks to plan talks and demonstrate all kinds of snippets — kaggle also hosts lots of notebooks which allow data scientists to show their work easily (probably the inspiration here).
Why so important? Because Hadoop vendors NEED to promote data science
Tons of large enterprises use Hadoop, but most of those haven’t really unlocked the promise of those installations (and millions of dollars advocated) yet. They are all investments in the future. Now those investments need to pay off dividends and generate business value or else these installations will be considered to be underwhelming at best, or failures at worst. Check out this figure from an O’Reilly report on the big data market:
Most enterprises aren’t that mature with their Hadoop practice or usage. It’s not as sticky as they’d like to see, with most companies being classified as Lab Project users or Tire kickers. Not exactly producing results
Cloudera launching products like this workbench makes total sense — they reduce the barrier to entry considerably and get a chance to bring clients to the “we use hadoop everyday for critical business processes stage”.
Step in the right direction
Figure above from the original Cloudera blog post
This tool definitely looks like a step in the right direction, giving easy loading of files stored in Hadoop in a slick IDE. Now the barrier to entry won’t be the access to the data or the significant technical (or security) hurdles of copying data from a corporate hadoop cluster to play with it locally. Each barrier to entry that falls will enable companies to spend much less time in the Lab project level and Application development stages and move quickly to Mature state, where we’ll see further and further automation being delivered as part of products and services.