Data Science Deep Dive: Using Real-World Heuristics to Assign Demographic Labels

Machine learning models are only as good as their training data.

Alex Block

High-quality demographic labels are vital for many products, yet gathering these is a challenge for everyone. On the surface, a solution leveraging machine learning seems straightforward: a well-trained model with enough information about a device’s behavior (e.g., the stores they frequent or the neighborhood they live in) should predict the labels of that device’s user. However, machine learning models are only as good as their training data—if these are inaccurate, the results will be too.

Historically, the gold standard solution has been user-declared labels (typically obtained from external providers), but even companies specialized in these face significant challenges in acquiring high-quality demographics data. Users don’t always share accurate information and, even when they do, associating that information with devices is an uncertain process. This situation has generated data scientists to express skepticism of these label sets. A recent analysis within the company of the self-reported genders (based on one such set of labels) of visitors to over 100 nationwide chains found identically zero chains with skews larger than 55/45—even though many of these chains exclusively target either men or women.

So how do we solve this issue? Aiming to improve the quality of our demographic data, we developed a heuristic-based approach for assigning probabilities for each label to devices.

We used observations that correlated with a given gender and/or age based on marketing research, census data, and employment statistics. We devised a plan to observe the gathered evidence, validate our heuristics skew, and combine all the evidence to assign demographic labels. Lastly, to validate our method, we applied our technique to the subset of Foursquare devices with self-reported ages (via birthdays) and genders associated with them. Here is our process and key learnings:

Writing Good Heuristics

The first step is writing good heuristics. To do so successfully, you must:

  • Identify a quantifiable bias toward a given demographic label or set of labels.
  • Show a feasible connection between the heuristic and some real-world behavior.
  • Define a clear distinction between a heuristic’s definition and its implementation. For our research, the heuristic is the fact that displays a demographic bias, and the implementation is how we attempt to derive that fact from our data.

Our full set of initial heuristics can be roughly broken down into a few categories:

Image: heuristics table

Gathering Evidence

We seek to accumulate “evidence” that a given device matches a given label (gender or age bucket) based on that device meeting predefined tests. We adopted a “prior” assumption about the distribution of our device population, collected “evidence” based on the heuristics that each device meets, and updated our prior based on the various pieces of evidence collected.

From this, we gathered posterior probabilities, combined priors, and collected evidence. Then we weighed these against each other, breaking them down into three blocks:

  • Weak or no evidence: Most devices will only meet a few of our heuristics and therefore fall into this category. In this case, we should expect to gather only a few “points” of evidence overall and recover our prior.
  • Strong evidence towards multiple labels: In some cases, we may make several contradictory observations and gather a high degree of evidence for multiple labels. These few devices should yield a roughly flat posterior regardless of our prior.
  • Strong evidence towards one label: Our ideal case involved multiple independent observations for a device suggested to be a single demographic. In this case, the evidence dominates over the prior.

In sum: for a high degree of conflicting evidence, we approach a flat posterior, regardless of our prior, for weak (or no) evidence, we generally retain the prior, and for strong evidence towards a single label, we yield a strong posterior for the corresponding label(s). Ultimately, only devices falling into the third bucket would be assigned labels with a high degree of confidence.

Validating our Method

Finally, we investigated the viability of our method by applying it to the selection of owned and operated devices for which we had self-reported ages and genders. We aimed to address the following questions:

  1. Do enough devices meet enough heuristics that we could feasibly use this technique to generate training data for a machine learning model?
  2. Do our heuristics correlate with the self-reported labels in the O&O population? If so, then we’d naively expect them to accurately predict labels for devices without self-reported demographics.
  3. Does this correlation disappear when we shuffle the labels on the O&O population? If so, we’ve demonstrated that the correlation is real rather than an artifact of a third variable.
  4. Do heuristics that we believe provide evidence towards the same label correlate with one another?
  5. Does this correlation disappear when we compare heuristics that we do not believe point towards the same label?

We developed and implemented a technique to assign “ground-truth” labels to devices based on our data and real-world knowledge. Roughly 2% of our input devices meet at least three heuristics (even before eliminating devices without recent activity), suggesting that the answer to our first question is a resounding yes.

Our method shows clear promise: even a limited “first draft” of heuristics is skewed towards the demographic labels in the way we expected. Moreover, these skews vanished when randomizing the labels in our test data, implying that they were due to a connection between the real-world behavior we inferred for a device and the demographics of the individual carrying it. We also identified significant overlap between “matching” heuristics—those we assumed would select devices carried by women were far more likely to overlap than purely random sampling (some pairings resulting in 200-300 times more devices in common relative to random). But, despite the great results, there is still work ahead of us to demonstrate that our method works on a larger scale.

What does this mean?

Our heuristics-based approach is viable—the idea of using market research numbers together with the visit behavior of a device to assign high-quality demographic labels to devices has promise. When fully fleshed out (our goal during this process was to demonstrate feasibility, rather than to produce a complete product) and accompanied with high-quality heuristics, it can yield demographic labels that are more accurate than the ones we purchase from third-party providers.

Given these findings, there are two routes that one could take this research:

  1. We could either push forward with using this technique to label devices for training ML models,
  2. Or we can use it to build a tool to evaluate the accuracy of labels either given to us by a third party or that come out of an ML model.

Provided that this technique outperforms data we can buy from third parties, the former would yield higher-quality predictions (and save some resources). The latter, meanwhile, would give us a new way to evaluate the labels that we purchase and the predictions that we make from those seed labels.

Whichever route we decide to take, the future of data looks promising and exciting.