Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Enhancing Data Quality with Cleanlab

Save for later
  • 600 min read
  • 2024-12-11 08:09:55

article-image

Introduction

It is a well-established fact that your machine-learning model is only as good as the data it is fed. ML model trained on bad-quality data usually has a number of issues. Here are a few ways that bad data might affect machine-learning models -

1. Predictions that are wrong may be made as a result of errors, missing numbers, or other irregularities in low-quality data. The model's predictions are likely to be inaccurate if the data used to train is unreliable.

2. Bad data can also bias the model. The ML model can learn and reinforce these biases if the data is not representative of the real-world situations, which can result in predictions that are discriminating.

3. Poor data also disables the the ability of ML model to generalize on fresh data. Poor data may not effectively depict the underlying patterns and relationships in the data.

4. Models trained on bad-quality data might need more retraining and maintenance. The overall cost and complexity of model deployment could rise as a result.

As a result, it is critical to devote time and effort to data preprocessing and cleaning in order to decrease the impact of bad data on ML models. Furthermore, to ensure the model's dependability and performance, it is often necessary to use domain knowledge to recognize and address data quality issues.

It might come as a surprise, but gold-standard datasets like ImageNet,

The above snippet is from the Cleanlab can come in handy as your best bet. It helps by automatically identifying problems in your ML dataset, it assists you in cleaning both data and labels. This data centric AI software uses your existing models to estimate dataset problems that can be fixed to train even better models. The graphic below depicts the typical data-centric AI model development cycle:

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
‘all-mpnet-base-v2’ as our choice of sentence-transformers for vectorizing text sequences. It maps sentences & paragraphs to a 768-dimensional this for the list of all models and their comparisons.

pip install ‘cleanlab[all]’
pip install sentence-transformers

Dataset

We picked the

Data Preview

Code

Let’s now delve into the code. For demonstration purposes, we inject a 5% noise in the dataset, and see if we are able to detect them and eventually train a better model.

Note: I have also annotated every segment of the code wherever necessary for better understanding.

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_predict    
    
from sklearn.preprocessing import LabelEncoder        
        
from sklearn.linear_model import LogisticRegression    
    
from sentence_transformers import SentenceTransformer    
    
from cleanlab.classification import CleanLearning    
    
from sklearn.metrics import f1_score

# Reading and renaming data. Here we set sep=’\t’ because the data is tab    
    
separated.    
    
data = pd.read_csv('SMSSpamCollection', sep='\t')    
    
data.rename({0:'label', 1:'text'}, inplace=True, axis=1)    
    
# Dropping any instance of duplicates that could exist    
    
data.drop_duplicates(subset=['text'], keep=False, inplace=True)    
    
# Original data distribution for spam and not spam (ham) categories    
    
print (data['label'].value_counts(normalize=True))    
    
ham 0.865937    
    
spam 0.134063    
    
# Adding noise. Switching 5% of ham data to ‘spam’ label    
    
tmp_df = data[data['label']=='ham']        
        
examples_to_change = int(tmp_df.shape[0]*0.05)    
    
print (f'Changing examples: {examples_to_change}')    
    
examples_text_to_change = tmp_df.head(examples_to_change)['text'].tolist()
changed_df = pd.DataFrame([[i, 'spam'] for i in examples_text_to_change])    
    
changed_df.rename({0:'text', 1:'label'}, axis=1, inplace=True)    
    
left_data = data[~data['text'].isin(examples_text_to_change)]    
    
final_df = pd.concat([left_data, changed_df])        
        
final_df.reset_index(drop=True, inplace=True)    
    
Changing examples: 216    
    
# Modified data distribution for spam and not spam (ham) categories    
    
print (final_df['label'].value_counts(normalize=True))    
    
ham 0.840016    
    
spam 0.159984    
raw_texts, raw_labels = final_df["text"].values, final_df["label"].values

# Converting label into integers

encoder = LabelEncoder()
encoder.fit(raw_train_labels)    
    
train_labels = encoder.transform(raw_train_labels)    
    
test_labels = encoder.transform(raw_test_labels)    
    
# Vectorizing text sequence using sentence-transformers

transformer = SentenceTransformer('all-mpnet-base-v2')
train_texts = transformer.encode(raw_train_texts)    
    
test_texts = transformer.encode(raw_test_texts)    
    
# Instatiating model instance

model = LogisticRegression(max_iter=200)
# Wrapping the sckit model around CL
cl = CleanLearning(model)
 
# Finding label issues in the train set

label_issues = cl.find_label_issues(X=train_texts, labels=train_labels)

# Picking top 50 samples based on confidence scores
 
identified_issues = label_issues[label_issues["is_label_issue"] == True]
lowest_quality_labels =    
    
label_issues["label_quality"].argsort()[:50].to_numpy()    
    
# Beauty print the label issue detected by CleanLab
 
def print_as_df(index):
    return pd.DataFrame(    
        
    {
    "text": raw_train_texts,    
        
    "given_label": raw_train_labels,    
        
"predicted_label":
encoder.inverse_transform(label_issues["predicted_label"]),    
    
},    
    
).iloc[index]    
    
print_as_df(lowest_quality_labels[:5])
LinkedIn

Modal Close icon
Modal Close icon

Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant