Enhancing Data Quality with Cleanlab

Introduction

It is a well-established fact that your machine-learning model is only as good as the data it is fed. ML model trained on bad-quality data usually has a number of issues. Here are a few ways that bad data might affect machine-learning models -

1. Predictions that are wrong may be made as a result of errors, missing numbers, or other irregularities in low-quality data. The model's predictions are likely to be inaccurate if the data used to train is unreliable.

2. Bad data can also bias the model. The ML model can learn and reinforce these biases if the data is not representative of the real-world situations, which can result in predictions that are discriminating.

3. Poor data also disables the the ability of ML model to generalize on fresh data. Poor data may not effectively depict the underlying patterns and relationships in the data.

4. Models trained on bad-quality data might need more retraining and maintenance. The overall cost and complexity of model deployment could rise as a result.

As a result, it is critical to devote time and effort to data preprocessing and cleaning in order to decrease the impact of bad data on ML models. Furthermore, to ensure the model's dependability and performance, it is often necessary to use domain knowledge to recognize and address data quality issues.

It might come as a surprise, but gold-standard datasets like ImageNet,

The above snippet is from the Cleanlab can come in handy as your best bet. It helps by automatically identifying problems in your ML dataset, it assists you in cleaning both data and labels. This data centric AI software uses your existing models to estimate dataset problems that can be fixed to train even better models. The graphic below depicts the typical data-centric AI model development cycle:

Data Preview

Code

Let’s now delve into the code. For demonstration purposes, we inject a 5% noise in the dataset, and see if we are able to detect them and eventually train a better model.

Note: I have also annotated every segment of the code wherever necessary for better understanding.

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_predict    
    
from sklearn.preprocessing import LabelEncoder        
        
from sklearn.linear_model import LogisticRegression    
    
from sentence_transformers import SentenceTransformer    
    
from cleanlab.classification import CleanLearning    
    
from sklearn.metrics import f1_score

# Reading and renaming data. Here we set sep=’\t’ because the data is tab    
    
separated.    
    
data = pd.read_csv('SMSSpamCollection', sep='\t')    
    
data.rename({0:'label', 1:'text'}, inplace=True, axis=1)    
    
# Dropping any instance of duplicates that could exist    
    
data.drop_duplicates(subset=['text'], keep=False, inplace=True)    
    
# Original data distribution for spam and not spam (ham) categories    
    
print (data['label'].value_counts(normalize=True))    
    
ham 0.865937    
    
spam 0.134063    
    
# Adding noise. Switching 5% of ham data to ‘spam’ label    
    
tmp_df = data[data['label']=='ham']        
        
examples_to_change = int(tmp_df.shape[0]*0.05)    
    
print (f'Changing examples: {examples_to_change}')    
    
examples_text_to_change = tmp_df.head(examples_to_change)['text'].tolist()
changed_df = pd.DataFrame([[i, 'spam'] for i in examples_text_to_change])    
    
changed_df.rename({0:'text', 1:'label'}, axis=1, inplace=True)    
    
left_data = data[~data['text'].isin(examples_text_to_change)]    
    
final_df = pd.concat([left_data, changed_df])        
        
final_df.reset_index(drop=True, inplace=True)    
    
Changing examples: 216    
    
# Modified data distribution for spam and not spam (ham) categories    
    
print (final_df['label'].value_counts(normalize=True))    
    
ham 0.840016    
    
spam 0.159984    
raw_texts, raw_labels = final_df["text"].values, final_df["label"].values

# Converting label into integers

encoder = LabelEncoder()
encoder.fit(raw_train_labels)    
    
train_labels = encoder.transform(raw_train_labels)    
    
test_labels = encoder.transform(raw_test_labels)    
    
# Vectorizing text sequence using sentence-transformers

transformer = SentenceTransformer('all-mpnet-base-v2')
train_texts = transformer.encode(raw_train_texts)    
    
test_texts = transformer.encode(raw_test_texts)    
    
# Instatiating model instance

model = LogisticRegression(max_iter=200)
# Wrapping the sckit model around CL
cl = CleanLearning(model)
 
# Finding label issues in the train set

label_issues = cl.find_label_issues(X=train_texts, labels=train_labels)

# Picking top 50 samples based on confidence scores
 
identified_issues = label_issues[label_issues["is_label_issue"] == True]
lowest_quality_labels =    
    
label_issues["label_quality"].argsort()[:50].to_numpy()    
    
# Beauty print the label issue detected by CleanLab
 
def print_as_df(index):
    return pd.DataFrame(    
        
    {
    "text": raw_train_texts,    
        
    "given_label": raw_train_labels,    
        
"predicted_label":
encoder.inverse_transform(label_issues["predicted_label"]),    
    
},    
    
).iloc[index]    
    
print_as_df(lowest_quality_labels[:5])

Enhancing Data Quality with Cleanlab

Introduction

Dataset

Code

Comments (0)

No comments for this article yet!

Enhancing Data Quality with Cleanlab

Introduction

Dataset

Code

Related Articles

Building Trust in AI: The Role of RAG in Data Security and Transparency