January 11th, 2021 · 7 minute read

ML Model Validation - Factors that Compromise your ML Outcomes

Updated: January 11th, 2021

ML Model Validation

ML Model Validation - Factors that compromise the predictions of your ML models

Bias in machine learning models is something that many people are very worried about. ML model validation, and the factors that impact ML outcomes all start with an understanding of how automated systems learn about our world and about us.

Automated systems learn about the world through machine learning models. The programmers who create these models don’t give the machines step-by-step rules to follow. They allow the models to learn on their own using vast sets of data. 

This can be useful for businesses, governments, and institutions, helping them to make predictions that can boost their operations. For instance, a college can train ML models to diagnose depression in students so counselors can intervene at the right time before students drop out of college.

So what is bias in ML models? Biases are present in ML models because they are developed by humans, and humans are biased. Think about it. Are you objective about everything and everybody? No? Neither are the people who work with machine learning. They have opinions, attitudes, and prejudices that might be reflected in their ML models.

Why is this a problem?

Biases in machine learning models result in predictions that are untrustworthy and even dangerous. Biases can originate from several sources: the data itself, the engineers who gather the data, the model designers and others on their team, and the data scientists who implement those models. 

Bias is an inclination or prejudice for or against one person or group, in a way considered to be unfair. A recruiter who is biased toward women might employ more women, whereas a recruiter who is biased against women might hire more men. When bias enters predictive machine learning algorithms, it can have dire consequences - albeit unintended - because they can lead to far-reaching decisions based on the machine learning model.

It is essential to understand how bias is introduced into machine learning models and how to detect it in your ML model validation practices.

 

1. Sample Bias

Sample bias occurs when the distribution of the training data is not representative of the actual environment that the machine learning model will be running in. The example is given of a machine learning model that is trained exclusively on daylight video while the self-driving car it’s being developed for will be driving in all conditions, including at night. The data is biased because no night-time examples were included.

Machine learning models are trained to make predictions based on a set of data that’s based on the past. They can only predict based on that, on what they have been exposed to. If the data is incomplete, so will the predictions be. The predictions are only as reliable as the human who chose and analyzed the data. 

The data scientist must choose sample data for training that fits the real environment it will be deployed in.

 

2. Prejudice Bias

Prejudice bias also happens as a result of human action. Prejudice occurs as a result of cultural stereotypes that people harbor. This can be present in anyone involved in the machine learning process. Prejudices related to race, social class, nationality, gender, and age can infiltrate a machine learning model and alter the model's output. 

For example, if a machine training model that is learning about men and women are shown more images of women in kitchens than men in kitchens, and more men in front of computers than women in front of computers, then the model learns that women are cooks and men are computer programmers. The data that the algorithm is trained on teaches it to incorrectly conclude that females don’t program and men don’t cook.

Data scientists must be aware of these types of biases and find a way to control it. 

 

3. Confirmation Bias

Confirmation bias is the tendency to process information by looking for, or interpreting, information that is consistent with one’s existing beliefs. In machine learning, it is the tendency to choose source data or model results that align with the data scientist’s or other team members’ currently held beliefs and to not question outcomes that support their beliefs and assumptions.

Data scientists and others involved in the process should be attentive to any results that either contradicts their expectations or confirm them. The results should be examined closely to confirm their real meaning: do they just reflect the team’s views, or are they authentic. One way to confirm the authenticity of the results is to test the model on another data set.

These are just a few examples of how biases are introduced into models. There are many more, including group attribution bias, exclusion bias, measurement bias, and algorithmic bias. All of these factors need to be continually evaluated in a proper ML model validation practice. Tools like Datahunter automate bias detection.

 

The need for clean data

The thing is machine learning models can also be compromised by how data is selected and cleansed. Bias can be introduced when data is selected from large data sets and during data cleansing operations. 

Data science teams work with large data sets they obtain from the Business Intelligence team. These data sets are mostly disparate and in need of organization and cleaning. Data engineers and scientists have to spend a lot of time, sometimes running into months, to clean and organize the data they received.

According to a survey by CrowdFlower, Forbes reports that data scientists spend around 80% of their time on data preparation and 60 % of their time goes to organizing and cleaning data. 

 

What is dirty data, and where does it come from?

Dirty data is data that is inaccurate, incomplete, or inconsistent. According to Experian, human error affects over 60% of dirty data, poor interdepartmental communication is responsible for about 35% of inaccurate data records, and an inadequate data strategy influences 28% of inaccurate data.

Dirty data can result from different departments archiving data about the same item or customer in different formats. Add to that mistakes in personal information like inconsistent spelling of the same surname, wrong addresses, and account number discrepancies and you have data chaos. In large data sets, these mistakes are unavoidable and can’t be corrected automatically.

 

The cost of dirty data

According to IBM, the yearly cost to the US economy of poor-quality data is $3 trillion. That’s not a small change. Experian reports company leaders across the globe suspect that 26% of their data is dirty. The cost to companies is 15% to 25% of revenue.

Dirty data impacts productivity, communication, resources, and ultimately the bottom line. 

One problem is the fact that data is often managed at the departmental level. This results in so-called “data silos” where data is available only to the specific department that processes, manages, and stores it. 

Many global companies have invested in data quality solutions, but the problem persists. According to the Experian survey, this is the case because organizations don’t have a centralized and complete data management strategy. 

Said Thomas Schutz, senior vice president and general manager for Experian Data Quality: ‘’Very few organizations have appointed a centralized manager for data quality and most lack sophistication in their data management methods. Organizations need to do more than buy a new piece of software; they need to make data quality an organizational priority and put the right team in place to manage that complex effort.”

 

Bottom line

Both biases in machine learning models and dirty data compromise the trustworthiness of machine learning predictions. Continual evaluation in your ML model validation practices is essential to ensure consistent and repeatable outcomes.

Before applying any algorithm or machine learning model to a dataset, data science teams must validate the quality and consistency in the dataset using tools like Datahunter that quickly and accurately locate bias and provide solutions to the ML developers.

At the end of the day, ML models can’t think for themselves. Their output directly relates to the input they received.

Subscribe to get updates when new content is published to the Datahunter Blog.

⇠ Back to Blog