Why Your Predictions Aren’t Worth As Much As You Think


Lately it’s really come into focus how difficult it can be to approve a model and get it to production in your typical dinosaur organization. Not only can it be hard to get the correct data, feature engineer and enrich it, select the highest yield modeling and preprocessing techniques and deploy, but after all of that, you still probably have a black box.

Unless you are using a stone age model like Logistic regression or some kind of GLM, you won’t be able to get more than a foggy idea of what’s going on. Maybe at best you’ve got the tree based importances from a Random Forest. Cue the groaning of 10,000 analysts asked WHY a model produces a certain prediction, when the drivers of a given predicted value don’t seem to compute in the realm of human intuition. Too bad human intuition can’t look at a fresh Neural Net from Tensor Flow and figure out what is going on under the hood. For that we need reason codes.

DataRobot (and some other tools out there no doubt) provides reason codes and other features to make the predictions of a model completely transparent, letting us know why a pitching prospect is a fantastic draft pick despite a high ERA, why a flawlessly running truck might be called in for maintenance despite no obvious signs of a problem, and why a patient might need to stay in the hospital despite looking healthy enough to discharge.

Call me crazy, but I think the prediction itself is actually relatively low value in comparison to the combined prediction AND reason codes, especially in the phases before and shortly after deploying a new model.

This is the age where a director of analytics who is running with dragons under the hood should hunt for ways to eliminate risk in all forms.

Before I deploy, I want to take the reason codes for my model, take a hist of the values from a few thousand prediction rows and visualize the results in Excel or Tableau. This will begin to show us the combinations of factors driving predictions. If I can somehow deploy a slower model, I’d love to save all the reason codes for EVERY prediction in a database so that I have an auditable trail if the predictions from the model were ever questioned. Why isn’t that a mega sign of quality? A model with a track record for its decision making is the one I’d pick.

Who’s Buying? Cloudera Is The Must-Follow IPO Nightmare That We Can’t Take Our Eyes Off [Analysis]

Screen Shot 2017-04-06 at 6.39.38 AMThe tech world has recently come to a stir over Cloudera’s huge S-1 filing, indicating their intention to IPO in the coming months.

A lot has been written about this in different circles but I’d like to take a step back and just think about what this means — selling Hadoop + security and automation hasn’t yet proven to be a break-even-able model yet for the firms in the business, but the two major players will still be public. Whatever happened to businesses turning a profit, exactly? You say, what about SNAP and other social media players? Those are B2C, they haven’t got people paying for services yet, they rely heavily on advertising, the value is really in the data and not the services, and other arguments come to mind. But enterprise software was supposed to be the place where companies actually made money, right? Not in Hadoop.

Why is it so costly? Here is why: these vendors are specializing in a space which requires hiring expensive talent (look at the market for big data engineers and you’ll see why), lots of field sales work (on-premise software solutions costing many $$$ millions don’t sell over the phone) plus proof-of-concept work to make a sale. In additional, organizations require buy-in from the C-suite to IT and Ops to make Hadoop happen inside a company, slowing down sales cycles to an ant’s pace. Other vendors in the space like Alteryx or DataRobot or Tableau don’t have the same magnitude of concerns since the analytics slice of the market is closer to the value-drivers, showing clear ROI with less upfront marginal costs.

Also Cloudera is likely hugely overvalued based on its last funding round. Check out the link and you’ll see some analysis covering why, but it doesn’t look pretty.

Where is the golden path to growth?

We know the famous Forrester Research quote where “100 percent of large companies will adopt Hadoop in the next few years”. Does that mean that this is a rapidly maturing market? Where is the mega growth potential for Hadoop if the largest potential customers will already be on board? Going down-market with cloud/managed Hadoop solutions like we see with Amazon EMR or Micrsoft Azure HDInsights could be a key move forward, making Hadoop licenses accessible for more mid-market players trying to catch up to the Fortune 500. This is the main play that I see opening up a path to future growth for Hortonworks and Cloudera, especially if they can price in a way that enables them to make a healthy margin on these. Doubling down on inside sales for high-availability private VPC deployments would reduce CAC (cost to acquire the customer) costs, upfront hardware costs (instead with a focus on SAAS engagement/recurring revenue) and shorten deal cycles to the level where they could rapidly become very profitable. Also note that once an Org has done the HUGE legwork to direct all their data streams into their Hadoop VPC, the solution itself is rather sticky. So why not disassociate the software from the cost of the box?

Let me wrap up by saying that I very much hope that Cloudera’s IPO is a smash hit, being in this industry myself (Cloudera is also a close partner of DataRobot). A profitable and wildly successful Cloudera bodes well for the rest of the new enterprise data stack (Hardware, Storage, Platform, ETL, Data Science, Viz) that is currently taking over the market.

As the market leader, Cloudera is aiming much higher than rival Hortonworks:


Hortonworks S1 (2016: 185m revenue)Screen Shot 2017-04-06 at 6.08.50 AM

How to lose a lot of valuable data – Uber pulls out of Denmark


This past week the news broke that Uber (embattled in many ways at the moment) will be pulling out of Denmark in response to new taxi regulations requiring certain kinds of meters and seat sensors. Definitely have mixed feelings about this one.

Uber has been forced to release data in several US cities (including New York City, obtained by data blog FiveThirtyEight https://github.com/fivethirtyeight/uber-tlc-foil-response) about their rides and trips. This pullout means that we may not see similar data from Denmark anytime soon, and represents a big loss. This data would be very interesting to see because so many Danes bike to fulfill their normal transportation needs. What are the kinds of trips that compel people to take Ubers in Denmark? What are the social needs this is filling that aren’t well served by bikes?

* Businesspeople in a rush to get to meetings across downtown Copenhagen? Then we should see significant usage between 8-4 on weekdays.

* Young people hopping between nightlife spots on weekend nights? Look for late night trips Thu night-Sunday morning.

* Shoppers who have purchased a little more than they can bike home with in one go? Look in the afternoon/evening near shopping districts and malls.

This data would be a fascinating look into the role of auto transport in a place with fantastic bike support and public transport, identifying key gaps in the transportation landscape. Gaps that we might be able to better address and fill in the future with new niche startups, or self-driving cars.

On the one hand, this seems to be a step in the wrong direction. Uber’s model is infectious and feels like progress toward the sharing economy – it allows for 10x the flexibility and convenience of a normal cab for both drivers and riders. I can’t tell you how many drivers I’ve ridden with who are in-between jobs or trying to save on the side for a wedding, buying a house, or something else important, and love that they can log in and make good money at a time that works for their schedule.

Uber downfall splits Danish politicians

Minister for Transport, Ole Birk Oleson. Photo: Liselotte Sabroe/Scanpix


On the other, this is going to stop a concerning trend whereby Uber is the latest segment in a long road toward replacing established and well-protected labor with cheaper contracted labor. If we assume that there is a bump in demand for cabs in response to Uber’s entrance to the market (making new customers from the pool of people for whom existing cabs aren’t the right fit), then Denmark is consciously making the choice that replacing, say, 100,000 regulated and protected jobs with 120,000 contracted and unregulated jobs is too high a cost to pay for mere extra convenience.

And who reaps the benefits of the squeeze on new contract workers? Parent company Uber, the new 20% of casual riders, and upper-class people using the service who get extra convenience and access to transportation. Uber can be viewed as a transfer of wealth from poor to rich, or as way to make everyone wealthier, depending on how you see it. And we won’t get to see any data to help us get to the bottom of things.

How to Achieve Better Customer Experiences with Machine Learning


I was interviewed last week on In The Know, a podcast covering different ways to improve customer and service experiences. We covered a scary amount of ground with just one podcast, covering questions like:

1. Machine learning vs data science…what’s the difference?
2. What types of business challenges are best addressed with data science and machine learning?
3. What are the most effective machine learning techniques?
4. How can machine learning and predictive analytics affect customer experiences?
5. What is the biggest misconception people have about data science and predictive modeling?
6. Real-world examples of how data science and machine learning have increased revenue, reduced costs and improved key metrics for enterprises?
7. How can you deliver better customer experiences with machine learning and predictive analytics?

Soundcloud – check it out!


Link to their post

Strategic: Cloudera Launches New Data Science Platform


This week Cloudera made waves by announcing a very strategic new feature plugged into their Hadoop platform. The press release touts the launch of a tool for “Self-Service Data Science for the Enterprise” providing a native interface for Machine Learning on Hadoop. I think it’s important to give this some voice on the blog because this falls right in line with a lot of trends right now in the enterprise big data landscape.

All data-service/data-tech companies are working to find a niche in the new AI/ML/Data Science world as some of the attention and hype grows around the application of these tools in the enterprise. Most industries haven’t really integrated fully with Machine Learning because of the lack of data science talent across the company. Very few organizations can claim that they have data scientists in every department, and those few are probably all consulting firms.

What’s interesting about this is that they are delivering this capability inside the browser, like an iPython/Jupyter notebook. This kind of tooling is very popular in the open source community and with data-oriented developers but definitely not the kind of thing we’re used to seeing in enterprise. I personally love to use notebooks to plan talks and demonstrate all kinds of snippets — kaggle also hosts lots of notebooks which allow data scientists to show their work easily (probably the inspiration here).


Why so important? Because Hadoop vendors NEED to promote data science

Tons of large enterprises use Hadoop, but most of those haven’t really unlocked the promise of those installations (and millions of dollars advocated) yet. They are all investments in the future. Now those investments need to pay off dividends and generate business value or else these installations will be considered to be underwhelming at best, or failures at worst. Check out this figure from an O’Reilly report on the big data market:

U.S. Companies using Hadoop

Most enterprises aren’t that mature with their Hadoop practice or usage. It’s not as sticky as they’d like to see, with most companies being classified as Lab Project users or Tire kickers. Not exactly producing results

Cloudera launching products like this workbench makes total sense — they reduce the barrier to entry considerably and get a chance to bring clients to the “we use hadoop everyday for critical business processes stage”.

Step in the right direction

Figure above from the original Cloudera blog post

This tool definitely looks like a step in the right direction, giving easy loading of files stored in Hadoop in a slick IDE. Now the barrier to entry won’t be the access to the data or the significant technical (or security) hurdles of copying data from a corporate hadoop cluster to play with it locally. Each barrier to entry that falls will enable companies to spend much less time in the Lab project level and Application development stages and move quickly to Mature state, where we’ll see further and further automation being delivered as part of products and services.



European FinTech industry descends on Copenhagen in June

2017_m2020_europe_date_primary_1 (1)

With the spring making its voice heard across the region, it’s time to highlight one of the really interesting conferences visiting Copenhagen this year – Money 20/20 Europe.

As the largest FinTech conference in Europe, we will see product launches, lots of innovative business models (really, much of FinTech is about carving out a profitable niche with an innovative model in payments or lending) and an opportunity to really see what these companies are up to when it comes to utilizing the large amounts of data being generated on their platforms.