“In God we trust; all others must bring data.”
“You don't know what you don't know.”
― W. Edwards Deming
This should be obvious. Decision making requires data to make decisions. And for some reason not every manager grasps the obvious. Not only does decision making require data to make decisions but you need the right data, and it is not always clear what is the right data. And moreover, data doesn’t have to be perfect to give you insight. I commonly heard the refrain that you must make all data clean and have some common format so that data scientists can derive insights. Nothing could be further from the truth. In all data you can find something. And maybe that something is actually nothing but that tells you something. Sounds absurd I know and reading that line back is a bit of mouthful.
You can’t dismiss a dataset merely because you think nothing of value is there. Maybe, quite possibly, combining that dataset with another ‘useless’ dataset creates a lot of insights. Or maybe while interrogating that dataset you realized that the data is so corrupted that it is useless for what you are trying to accomplish. Datasets are tracking information for a reason and while that reason might not always be clear understanding what is in a dataset will help you in your analysis at some point. I have yet to find a dataset that did not provide some value. Maybe it was a small addition to a complex process or a tiny piece of critical information that wasn’t obvious at first, but it provided information.
The first question any data scientist should ask themselves is what problem they are trying to solve. On second thought everyone should ask themselves that question. What are you really trying to solve for? Are you trying to understand why there is a higher-than-normal defect rate in your manufacturing process? Are you trying to better understand how your clients are using your products together? Are you trying to do client behavioral analysis based on petabytes of application usage data? Maybe you are a small company merely trying to predict sales for next year to do better planning.
I am not a fan of giving a data scientist and going to ‘see what they find’. In the early days of my career in data science this was the common approach to trying to build models. I too often thought that this was the best approach. A scientist might find something of interest but most likely they will find some piece of information that has no value to the business.
Business context and domain information is critical for a data scientist. Domain knowledge provides nuance to solving complex problems. Data is the foundation upon which a data scientist will start building a solution to a problem. More data doesn’t necessarily mean better answers. It could but don’t jump to that conclusion. Having the right data to help solve the problem is far more critical. Additional data can augment the solution and give you greater depth and nuance but it’s a fallacy that having tons of data is a requirement to solving problems. More data might just make the problem far more complex than it needs to be. I think that layering in data as it is needed is a better approach. Taking a stab at a model and see what’s missing first and then bringing in more data to enrich the model. Starting small and continually augmenting in the beginning is easier. Now I am not saying that there aren’t a certain class of problems that don’t require massive amounts of data upfront because there are but those are the exceptions and not the rule (e.g., think client behavioral analysis).
The right data regardless of the size of the dataset is far more critical. Now I know that’s heretical to many in the tech world that build and sell systems that can handle trillions of rows of data but for many problems having a spreadsheet of data is all that you need to make progress and get critical insight. The first analysis I ever did was based off spreadsheet data and saved the firm over $10,000,000 and it took me all of five hours to create a model in an analytics tool. The key point here is that I had the right data and not a bunch of superfluous data that added no value but added unneeded complexity. This wasn’t a massive data problem or even a big data problem. It was a small data problem. I was working with 10s of thousands of records but, I had a deep understanding of the domain, and I knew what I needed to build a model that got me the results I wanted.
So, it all goes back to understanding what problem you are trying to solve. What is your ‘use case’ to use the common tech vernacular. And this is true for any size organization so I can’t say it enough but understanding the problem you are trying to solve before you embark on grabbing data. Not only do I want to understand the problem, but I also take the long view if I know that one question is going to lead to deeper questions that will require additional data. I build for that. I try to understand how I can apply data and data science to solving a broader class of questions before I pigeonhole myself into a corner. That’s experience.
Data Culture
I will make a statement and feel free to disagree, but most companies do not have an active data culture. They will say that they do but do they really? They will hire a Chief Data Officer (CDO), generate beautiful presentations that are given to investors but behind the scenes what are they really doing? Most CDO roles and individuals I have encountered in those roles are not engineers. They don’t know what it takes to truly build a data culture where everyone is fully vested in monetizing the data that is generated. From the most junior of programmers to the CIO. Data is viewed as a side effect of operations. It is not viewed as an asset that is to be monetized. Crazy I know but this is how many organizations still operate. The heads of the business often don’t appreciate that their biggest asset is never monetized. And on the other end of the spectrum are data engineers that are happy to transform data all day long into a usable format for the data scientists but will never make recommendations to developers on how to streamline processing for modeling.
Firms generate billions of pieces of data on a given day but is there any real thought going into what is captured and how it is modeled in the database? No. While the data might be accurate and clean it is modeled in a way that makes it very difficult for a data scientist to perform analysis without first transforming it into a ‘model’ view that allows easier analysis. Now these are two different representations but there is a middle view that simplifies the process and accelerates modeling. Programmers are not thinking about how to represent data for analysis when they model it for their applications and it’s highly unlikely that they have ever spoken to a data scientist and asked them if there are changes that could be made to facilitate easier modeling.
Building a comprehensive data culture starts at the top of the proverbial house. This is not something that one person can do. Sure, a CDO can talk about it at length and educate everyone in a firm, but this has to be a culture shift. Employees need to understand that this is about the future of their firm and that everyone needs to look at the data generated though a very different lens. And that lens is one of monetization. Culture shifts start at the top. It doesn’t trickle up. The minions at the bottom of the ladder are not shifting the culture. A CEO needs to fully understand and takes ownership of monetizing the firm’s data and making it available to data scientists. They need to be fully vested. Monetization of data is a long cumbersome process that takes time to pay off.
Even small companies need to shift their views away from just data generated to utilizing the limited data that they create for the benefit of their business. This is not a big medium or small company problem but a mindset shift that all must embrace.
I’ve worked at firms that threw away good data because they simply thought it was of no value. Data that was an absolute goldmine in helping them further build and optimize their business. Yes, some of this data is dirty, unstructured and complicated but once you understood what was in it you rapidly realized that it could generate hundreds of millions of dollars in value. It could impact every single aspect of the business from pricing to work force optimization. And until someone with a very different perspective realized its value it was all being thrown away.
A data culture starts at the top.