The Tyranny of Metrics

There seems to be a growing trend in technology--and in all aspects of our lives, really--to measure everything: business outcomes, organizational success, and individual performance. I wrote about engineering productivity and how I disagree with measuring individual performance based on metrics before, so I was excited to start reading The Tyranny of Metrics by Jerry Z. Muller.

In 1975, two social scientists named Donald T. Campbell and Charles Goodhart independently identified patterns as they researched the idea of pay for measured performance. Campbell claimed "the more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures, and the more apt it will be to distort and corrupt the social processes it is intended to monitor." In other words, when a given metric becomes the target, some participants will start gaming for their gains, so the metric stops being a good indicator of what it was initially intended to measure. Goodhart as paraphrased by Marilyn Strathern, says nearly the same thing, "when a measure becomes a target, it ceases to be a good measure."

In the book, Muller draws on both Campbell and Goodhart's expertise, then provides more insights on metrics fixation and the detrimental effects they have on our everyday life.

Key Takeaways

As we introduce more and more metrics, we fall prey to the distortion of information.

We tend to measure what's the easiest to measure and because of that, we tend to measure inputs rather than outcomes. It's a natural human tendency to simplify problems; in most cases, measuring what's important is a challenge.

One example of this fallacy is charitable organizations. It's difficult to measure their outcomes, so foundations and governments measure the percentage of each charity's budget on fund-raising and administrative costs as a proxy for efficiency. This metric's collection and publication led to underspending on staffing and degraded capabilities, interfering with their operations. As with any organization, charities require competent staff, suitable office space, and efficient information systems. To help, their leaders often try to game the system by reporting that most of the staff works on projects, which leads to a loop in the system since organizations are valued according to ever-improving numbers.

We tend to degrade information quality through standardization. Numerical data allows us to compare things more easily, whether it's humans or entire organizations. When we compare these things, however, we often miss out on important context. As numerical data washes away uncertainty, we stop looking for those nuances.

We tend to replace judgment with metrics. "The demand for measured accountability and transparency waxes as trust wanes," as Muller puts it. The author reasons that lack of trust leads to creating new metrics, contributing to the decline of reliance on our judgment. It usually happens because of our belief that numbers convey objectivity, so they become a safer bet than judgment.

One academic institution in Great Britain measured faculty members by the number of publications they published in a Research Assessment Exercise, so they were incentivized toward quantity over quality. This led to a bias towards short-term publication instead of long-term research. Worse, they began to add metrics on top of metrics once they realized researchers were catching on, like counting how often an article was cited to measure its "impact". The problem is that this approach doesn't distinguish between positive and negative references. One person might tweet that the article was useful; another might say it was worst article ever written on the topic. Both would count as a reference to this article. Academia has encountered the same problem.

Wouldn't it be better if we'd trust each other's professional judgment? Using academia as an example, the assessment could come from a small committee or a department chair.

We don't just suffer from distortion of data, but there will be inevitable attempts to game the metrics.

We tend to improve the metrics by distortion of data. This happens when participants omit or re-classify data points that reflect on them unfavorably. This type of gaming the metrics is more frequent in policing, since politicians demand decreased crime rates to show their constituents.

Police departments collect crime numbers through CompStat. Originally a software solution, CompStat has become broad program that combines management, philosophy, and organizational management tools for police departments. Its purpose was to provide insight into crime hot spots, so the police could deploy resources accordingly, but this was lost when politicians began demanding better numbers. Some police department heads started to believe their careers depended on improving these statistics. As a result, break-ins became "trespassing" and theft became "loss of property". Essentially, major crimes became minor offenses so police department heads could produce favorable data for their superiors.

We tend to improve the metrics by lowering the standards. In software engineering, lowered standards usually occur when a group repeatedly violates their SLAs, and instead of fixing the root causes, they just keep increasing the error budget.

We tend to game the metrics by creaming. This happens when we pick simpler targets with less challenging circumstances, making it easier to reach goals, which invites complacency and stagnation.

Consider a medical practice where compensation is tied to the success rate of surgeries. Some surgeons will pick up cases that are less complicated so their performance indicators look better. This can leave patients with more serious issues on the sidelines with qualified doctors avoiding their cases.

There are also inspiring examples of using metrics in hospitals, however. Peter J. Pronovost and company created a checklist of five simple ways to reduce a type infection that was troubling hospitals. They collected metrics on infection rates, and made them public, which led to pressure for other hospitals to make improvements. The project was a success, because the metrics weren't used to penalize or compete, but show their peers the infection rates were manageable.

We also tend to underestimate the cost of collecting metrics. Data collection often means adding more people to an organization, some whose sole responsibility is to make sense of all the numbers.

Conclusion

We can run into a lot of pitfalls when creating new metrics. If you are about to create a new metric, ask yourself these questions:

What is the information you are trying to measure? Metrics on human activities tend to become unreliable, since it's human nature to react to being measured.
Is the metric you picked a good proxy for the information you are trying to measure ? If not, you probably do not want to measure it.
Do you really need one more metric? Think about the additional costs it would take to analyze the new metric. Information is never free.
Will you publish the data for internal use or for external consumption? Think about the policing example above.
Who should be involved in creating a new metric? Getting the buy-in from participants is easier when they drive the metric's creation.

Sometimes, the best metric is no metric.

To learn more, check out The Tyranny of Metrics by Jerry Z. Muller.

Hi 👋

My name is Gergely, and this is where I write about engineering management and open-source.

👶 ☕️ 🚵 🥐 🏂 🏔 🐈 🏀 🌁

Recent

For you