The Abacus and The Canvas

Luca Cazzanti's blog

Addressing the Three Types of Uncertainty to Build Successful Information Products

Longitudinal plots missing data
A while ago I was listening to a talk about uncertainty. The speaker broke out uncertainty into three different types: randomness, incompleteness, and inconsistency. It made me think about how these three types of uncertainty come up when building data-driven information products. After all, building data-driven, information products is about transforming raw data into information, and part of gaining information consists of reducing uncertainty. Machine learning is at the core of building modern information products, and indeed one may think of a machine learning model as a recipe for taming randomness, a type of uncertainty. I argue that sucessfull information products address all three types of uncertainty, and that a model specification describes not only the particular machine learning method chosen for a specific problem, but also the data processing pipelines that prepare the data for said model: these pipelines typically address incompleteness and inconsistency. A data scientist must address all three.

Randomness

Randomness is the instrinsic varibility of the data. We are more familiar with randomness than other types of uncertainty, because it is ubiquitous and at the core of data analytics tasks. We typically address randomness using probabilistic modeling, statistical, and machine learning. Consider tabular data: A standard assumption is that each row represents a sample from an underlying multivariate probability distribution. In fact, the notion of randomness is so strongly rooted into the core of data analysis that even a when we adopt non-probabilitic models we still think about robustness to randomness to evaluate the model's performance. This is why we keep separate training, cross-validation, and test subsets! So, from linear discriminant analysis to probabilistic graphical models, to decision trees and SVMs, randomness lives, and many techniques exist to characterize its effect on the predictions and to cope with them.

Incompleteness

This type of uncertainty refers to missing data. Perhaps a sensor became disabled during data acquisition, or respondents to your survey left some questions unanswered. Either way you have NULLs in your data table, and you must deal with them. If you can afford it (you have lots of data and relatively few NULLs), you could simply throw away the rows that contain NULLs. Another approach to dealing with incomplete data is to use one of the many flavors of imputation to fill the empty cells with a guess: the mean or median value, or a value drawn randomly from an (often estimated) prior distribution. Yet another approach is to consult subject matter experts to understand which imputed values make sense in their specific problem domain.

Inconsistency

In classification problems inconsistency arises when the given class labels are contradictory, that is when identical samples have different class labels. Another, trivial case of inconsistency is two files containing the same kind of data but different column names. More subtle inconsistencies arise when humans unwittingly inject them into the data. Here's what I mean (adapted from a real story). One of the columns in my data table contained angle measurements, which must be between $-90^{\circ}$ and $+90^{\circ}$. One sensor, for some reason, sometimes reported measurements outside this range. A well-meaning data engineer working on the data acquisition pipeline decided to apply a hard threshold to angle masurements beyond the standard range, so any value above $90^{\circ}$ was stored as $90^{\circ}$, and similarly for $-90^{\circ}$. This screwed up my angle statistics, and rendered useless the inferences I had drawn from the data up until I found out about this.

Why this matters for building data-driven, information products

Building data-driven, information products is about transforming raw data into information, and part of gaining information consists of reducing uncertainty. Randomness is the most salient type of uncertainty, the one that data scientists are most eager to tame, and for which many approaches exist. Inconsistency and, to a lesser degree, incompleteness, are often treated as data cleaning annoyances, necessary evils that must be addressed on the way to the fun part of the project, which is building models that explain and cope with randomness.

In real-world applications, however, inconsistency and incompleteness are pervasive, and I argue that they should be treated as first-class citizens by machine learning scientists. The "machine learning model" is a recipe for addressing all three types of uncerntainty, not just randomness: the model starts with the data in rawest form, and it should explicitly spell out all the steps taken to arrive at the final product. This means that the steps taken to address inconsistency and incompletenss should also be included in the model specification.

For this reason, I argue that the machine learning scientist in charge of an analysis should own the entire data processing pipeline, not just model building, and fully document it as part of the final delivarable. Some projects are large, and require splitting the data engineering tasks from the analysis. In that case the machine learning scientist should provide the thought leadership to the rest of the team and ensure that the extract, transform, and load (ETL) pipelines are consistent with the semantics of the problem. If you are a manager, it is your job to make sure the interface between analytics and data engineering is a smooth as possible.

You may have heard the saying that a data scientist spends 80% of the time preparing the data and 20% actually building the model. This factoid is usually brought up as a problem to be fixed, the idea being that processes should be in palce that provide nice data for the scientist to use. I do not agree with this negative connotation. In fact, I am not one to complain about spending 80% of my time cleaning data and 20% building models. Yes, it can get repetitive and a bit rote at times, but that 80% is invaluable for understanding a data set, determining the extent of missing and inconsistent values, articulating and prototyping the ETL steps. I totally embrace it because it sets me up on the right path for a proper machine learning-based deliverable. Thinking of the three types of uncertainty as interlocked components helps me throughout my data analysis, and reminds me that, at the end of the day, as a machine learning/data scientist I am in the business of reducing all of uncertainty, not just randomness, in order to create value for my organization.

Comments