Jackson Argo

Me

Software Developer, Linux Enthusiast, Musician, Weeb

Minimizing Harm in Machine Learning Classification Systems for Sensitive Categories

10 August 2021

Abstract


It is notoriously hard to pin down the definition of things we might want to measure and collect data about. Seemingly mundane words like sandwich become increasingly obtuse as we try to pin down the list of things that are and are not a sandwich. One merely needs to search the controversial question is a hotdog a sandwich? to find plenty of polarized hot takes on the topic. For more abstract concepts like race, culture, and gender, deciding on a universally accepted definition might be an impossible task. Yet, having a concise, often one-word description or label is a core requirement of many machine learning algorithms. In practice, this often means someone who is not an expert on the topic, usually the engineers or policy-makers tasked with building the data collection systems, has to decide what definition to use. Unfortunately, it is these same abstract concepts that have the potential to cause the most harm to marginalized people when unchecked. This paper aims to create a set of guidelines intended to minimize the potential harm from these kinds of systems.


Motivation


I work at Slack, and my team’s role is to provide companies that use Slack with access to their own Slack usage metrics. We can’t simply give companies a download of all the http requests their users sent to Slack. Aside from the privacy implications, data in this format is not useful for most people outside of Slack. Instead, we process the data into a set of precomputed metrics, and provide those to customers. The big question that my team constantly grapples with is what does this metric actually measure? You might think that metrics which are core to the way people use Slack, such as number of messages sent on a given day by a given user, would be trivial to define. However, computing metrics like number of messages sent requires a surprisingly precise and nuanced definition of that metric. In order to cover customer use cases, we actually provide several metrics that measure the number of messages sent in subtly different ways. For instance, we have one metric that counts the number of messages a user types directly into Slack, and a different metric that counts the number of messages sent by an app on behalf of that user. To add complexity to the situation, these metric definitions change from time to time. Introducing a new feature to Slack can change paradigms and require new definitions for certain metrics. Sometimes we catch a bug in implementation and the metric that we’ve been calculating doesn’t actually match the definition in the documentation. This can happen upstream of our datasets too, so that when we do recompute metrics, more changes propagate into the metrics than we expect. We also may get feedback from customers that a metric we provide is misleading or simply not useful in its current state. When we make the choice to update a metric definition, it can be a costly procedure. We typically recompute those metrics over several years worth of data, which can take significant developer time and cloud resources. Then, we have to explain to customers why the numbers changed, which hurts customer confidence in Slack’s metrics. For these reasons, it is in Slack’s best interest to spend a substantial amount of time developing, testing, and maintaining its usage metrics, and thankfully, my team does this to the best of our ability.

This process of collecting a set of raw metrics and reducing it down to the relevant information for a target audience is certainly not unique to messaging apps. In the morning, I always check my cell phone for the weather information. My phone analyzes the entire morning forecast, including temperature, barometric pressure, wind speed, etc. and categorizes it into one of several weather phenomena. It displays a cute illustration to represent the chosen category, which can be a sun, a cloud, a rain cloud, and so on. This is very convenient for me, since I don’t care to read the entire daily forecast. Unfortunately, my phone will make mistakes from time to time. My phone may show a rain cloud, but I never see rain. Or somewhat more subtly, my phone might show a sun with a blue sky in the background, but it fails to incorporate the haze of air pollution turning the sky orange. These mistakes have happened enough times that I also know to use my window as an extra source of information. I use what I see outside to better inform my own subtly different weather categories. Of course, it would be great if I could update the categories used by the app, so that this extra step is not necessary. Unlike at Slack, however, I am not an employee of Weather App Inc, so I can’t change the application to use my own definitions. I also do not have the know-how or time to build android apps, so making my own weather app is out of the question. Further, Weather App Inc has never reached out to me for input regarding these categories. If they ever do, I will gladly give them my suggestions.

These are just a few examples of ways that we interact with categorization systems on a daily basis. Subtle differences in categorizing the weather forecast or calculating usage metrics on a messages platform will likely cause a trivial amount of harm to individuals. However, categorization systems used to inform government and policy decisions, such as the US census, can have a much greater impact on people’s lives. These systems have the same insurmountable task of trying to categorize things, only this time the thing to be categorized is a person, and the risks are much greater.

Injustices of Categorization


In the same way that an illustration of a sun or cloud conveniently simplifies the weather forecast, categorizing people into races, genders, sexualities, classes, and so on, conveniently simplifies the huge dynamic of the human experience into just a handful of attributes. Systems that use these categorizations are susceptible to the same limitations that affect Slack’s usage metrics and my cell phone’s weather app. Definitions are bound to change over time. We as a society may not necessarily agree on the specific category definitions or we may find that certain categories should be included, while others should be removed. Like all categorization systems, mistakes will happen, and the people affected often find themselves with little recourse to make corrections. As Kate Crawford explains in Excavating AI, “categorizing humans is becoming more and more common,” and these systems have “no way for outsiders to see” how the categorization decision was made.

These categorizations are deeply embedded in our laws and government systems. Take passports as an example. Passport applications ask for your gender, and your answer is printed on your passport. Perhaps this acts as a way to assist with identification and lawful entry at the border. If John Cena tried to enter Canada and his passport said he was a woman, the border patrol might have some questions. John Cena fits nicely into the man/woman gender binary, and the border guard is reasonably suspicious of this inaccurate passport. Now compare this to the experience of people outside the gender binary. The US passport application only allows a person to choose between man or woman for gender. How should a gender non-conforming person answer? If this field truly is used for identification, but neither man or woman accurately identifies them, then either response is equally useless. Perhaps there is some hidden use case for gender on a passport where even an inaccurate response is useful, but without making such a use case known to people who are filling out their passports, there is no way that those use cases could be taken into consideration. The woes of gender categorization does not stop at the passport gate. Constanza-Chock explains the painfully uncomfortable scenario of trying to get through airport security as a trans person. The airport’s millimeter wave security scanners are designed to make threat assessments in part informed by a person’s gender. The security scanner has a list of gender options, either man or woman, and both options have their own particular criteria. When someone does not fall into one of these two options, the machine fails to make a proper threat assessment. A TSA agent now has to perform a pat down. TSA protocol is that male officers should perform a patdown for men, and female officers should perform patdown on women. Again, the protocol breaks down for the same reason as the millimeter wave scanner. This person is neither a man or woman, so who performs that patdown? Excluded options from a particular category is a common issue in categorizations, so perhaps a solution is to include more options. The OMB’s race and ethnicity framework notably leaves out an important option: Middle Eastern and North African. Suppose OMB created 1,000 categories, someone could propose yet another category. However, now this categorization has become so broad that it no longer serves the purpose of condensing information into more digestible by people. Similarly, we could choose to describe people by their genome sequencing alone. It would be more accurate, but in the same way that Slack does not turn over its access logs to customers, this would come with deep privacy implications and also not be very useful in most cases.

Categorizing people has a very troubling past. Throughout history, it has been implemented many times for various reasons, with varying results and public opinion. Unfortunately, these categorizations tend to be entirely in service of some existing power structure. As an example, categorizing people as black or white has been a large part of US history, with evolving needs and definitions, but it almost always comes in some form of discrimination. Until the institution was ended, slave-status in the US was directly tied to a person’s racial categorization, but methods to determine a person’s race saw several iterations. As Wood explains, “the imposition of hereditary race slavery was gradual, taking hold by degrees over many decades.” The abolition of slavery unfortunately did not abolition the use of racial categorization for discrimination. During the Jim Crow era, the distinction between black and white race became so nuanced that the “one drop rule” was invented. Simply looking at someone’s skin color was no longer enough to make the distinction. A person’s birth certificate and ancestry became part of the formula for racial categorization. Thanks to updated civil rights laws, today’s categorical systems for people are less obviously racist, however they are still built upon the same paradigms and power dynamics. Race would hardly be worth mentioning were it nor for the person in power who decided race is an important classifier of people. You may think that we should avoid using categories of people at all, and in many cases this is true. It’s clear that there are many potential harms that can come from this, and when we can avoid categorization, we should. However, we should not ignore social categories altogether. These categories were created from a power structure, and we can study these categories and their usage to expose the underlying power disparities and discrimination. For example, if scientists choose to ignore gender categories when studying income distribution in the US, we would be blind to the gender pay-gap. Or, should statisticians take a colorblind approach to race, we would not be able to expose injustices in over-policing and over-incarceration of minorities. Unfortunately, we are far from King’s ideal where people are judged “not by the color of their skin.” In reality, we are judged by much more than just our skin, and this judgement comes baked into our machine learning data. It may come as no surprise that the people designing and building categorization systems, data scientists, software engineers and lawyers, are mostly white men, a category of people at the top of the power dynamic. This is a fairly limited scope of perspective used to implement some very critical and far reaching systems. AI systems today are used to inform “decisions about bail, sentencing, and parole” (WIRED). Whether fairly defined or not, these social categories do impact people’s lives, and there are appropriate times when these categories should be used in data science research. We leverage AI more and more in our decision making processes, and this can have broad consequences when applied to marginalized groups.

Current Protections


Categorizing people has such a troubled history that many laws have been enacted to regulate the data you can collect about an individual and limit the categorical decisions you can make using that data. These laws come in many forms, two of which I will discuss further are online privacy protections and anti-descimination laws. Online privacy protections emphasize the importance of individual privacy and establish the rights of online data subjects. Rights of online data subjects have seen huge advances in the past decade. GDPR, passed in 2016 by the European Union, aims to establish the “responsible use” of personal data by enforcing a set of “common rules” that apply to anyone using and collecting such data. California has its own online privacy law called California Online Privacy Protection Act, which serves much the same purpose as GDPR. These laws require companies that collect and process personal data to have a privacy policy describing how personal data is used and shared and to allow users to opt-out of certain data collection. While users have more control over the collection of their personal data, this does not necessarily protect them from the categorical harms they might face. The US anti-descrimination laws, on the other hand, are designed explicitly to protect people from certain outside decisions informed by that person’s categorization. The Equal Education Opportunity Act, or EEOA for short, explicitly prohibits actions that would deny an “equal educational opportunity” to any child due to that child’s “race, color, sex, or national origin.” The Fair Housing Act uses a similar format, this time prohibiting housing providers from discriminatory practices based on “race or color, religion, sex, national origin, familial status, or disability”. These laws were not founded to protect people from dangers on the internet, but have still been applied in that context. These laws are particularly relevant in the field of automated advertisement. Automated advertising based on a person’s demographic categories is a cornerstone of any “free” to use website or app. The decision whether to display an ad for a pink dress or a black tuxedo is most likely informed by a person’s gender. Suppose a particular person’s gender was mis-labeled or even that a person’s gender actually has no bearing on how they dress. In this case, the automated advertising system would be flawed, but the harm is fairly trivial. This is no longer the case when the decision is between an advertisement for a dress or an apartment listing. Suddenly, the potential for harm has greatly increased, and the decision to show a particular advertisement is now a civil rights issue.


The Framework


I hope by now, the motivation and importance of such a framework is clear. Finally, I introduce the framework for minimizing harm when categorizing people and making decisions based on those categories. This framework is intended for engineers, scientists, and anyone else building systems that need to classify people into sensitive categories or make decisions about such categories. I will develop the framework in 3 components: Consult the Stakeholders, Transparency By Design, and Keep the Algorithm Alive. The components of this framework function together as a toolset to reduce harm when categorizing people and making decisions with those categories.

Consult the Stakeholders


The first component is Consult the Stakeholders. The Stakeholders are anyone who might be harmed by this categorization, either through data collection or its use to inform decisions. Your first inclination might be to consult scientists, engineers, or other peers who developed a similar system of categories akin to your needs. The category definitions are subject to change according to the context in which they are used. Suppose we had a model that looks at the age quantile of a population, that is to say the oldest x% of that population. The oldest 20% of the US population are 65 and older, so we categorize anyone who is in the oldest 20% as a senior citizen. It would be absurd to use that same model to categorize the oldest 20% of preschoolers in Germany. A set of category definitions that worked for one service is not necessarily guaranteed to be the perfect solution. It is critical to include marginalized groups when designing the algorithms and defining these categories. Diversity in the engineering and design teams is an easy first step. You can immediately begin to broaden perspective and bring new understanding of these categories and their risks to the whiteboard. Team diversification is not always within the control of the design teams. Hiring, recruiting, and budgeting, all critical to ensuring team diversity, are often controlled by separate management within the company. Further, diversity alone is not enough, as it does not address the structural power dynamics that exist outside the team’s domain. “Employment diversity is a necessary first move,” as Costanza-Chock warns, “but it is not the far horizon of collective liberation.” In order to reach beyond the limitations of team diversity, engage the community and users in the design of these systems, particularly users from marginalized groups. One example of this is allowing users to view and correct their information and categorizations, an important component of data subject rights required by GDPR.

Transparency by Design


The next component is Transparency by Design.Transparency in machine learning systems is a well researched and discussed topic, but ultimately aims to give users more confidence in the systems as well as some form of contestability when the system makes a mistake. Burell explains, “finding ways to reveal something of the internal logic of an algorithm can address concerns about lack of ‘fairness’ and discriminatory effects.” Transparency should exist both in the designs of these systems and categories and in the way these systems are used. It should be obvious to the end user what categories are used and how they are defined, and ideally the system itself should provide visibility into the decision-making process. This can be accomplished by publishing clear and concise documentation that defines the categories and selection criteria. Transparency by design is analogous and complementary to the privacy by design paradigm of individual privacy. Both give the user more control of their data, and both act to minimize harm of marginalized groups. Practically speaking, transparency is not so easy to implement. Even outside the field of artificial intelligence, application visibility is a massive domain. The budding app engineer has an excess of options for visibility tools, because so many people and companies have attempted to solve this problem. Machine learning is unique from other app development in that the algorithms often entirely throw away explainability for the sake of accuracy, but when making determinations based on social categories, this becomes a huge burden to contestability by the users and auditing fairness. It may not always be possible to implement sufficient visibility into a machine learning algorithm, but Burrell describes one final catchall solution: “avoid using machine learning algorithms in certain critical domains of application.”

Keep the Algorithm Alive


The third and final component is Keep the Algorithm Alive. Our understanding of the world around us is constantly changing and our machine learning models should adapt with it. If we let our machine learning models use outdated categories, we depreciate the value and accuracy of our model and increase the potential for harm. Definitions of words change and commonly understood categories may have become subtly different. The author eponymously named The Wheelchair Historian describes the origins of several words used to categorize persons with disabilities. Words such as cripple and lame were considered less offensive in the past, and are frequently found in historical text. However, the negative connotations of those words have since been rejected, and we would be shocked and reasonably offended if we saw these categorizations pop up in a machine learning algorithm. When designing models defining categories and label names, consider how long the labels and models are usable until it should be re-evaluated. For more sensitive and volatile data, or any data with a higher risk of causing harm, this should happen more frequently. Check that the code still agrees with the published documentation. Assign a team or person as the Data Matter Expert who has ownership and responsibility of the data definitions. Finally, in pursuit of optimizing for accuracy and minimizing harm, the process of consulting the stakeholders and maintaining transparency in design should be alive and in constant review.


Conclusion


Categorization is an important part of decision-making because it allows us to distill large amounts of data into more manageable and communicable chunks. While this is convenient and practical, it is not always appropriate and can cause undue harm to marginalized individuals who exist outside the bounds of the categories. There are existing protections in place to combat discrimination based on categories, but these legal protections do not provide a framework for engineers and designers who buildmachine learning. When particularly volatile or sensitive categories are poorly implemented or poorly understood in machine learning algorithms, they can have far reaching consequences. For these reasons, it is important to include marginalized individuals and any other stakeholders in the designs of these systems. The design choices, categories used, and selection criteria of the categories should be made transparent to its users. And finally, regular maintenance of the dataset includes validating existing definitions, expiring tired categories, and continually incorporating feedback. With these three framework components working together, we can build more fair and just machine learning systems.


References


tags: data - machine learning