Data Cleaning Script Generator
Added Apr 1, 2026
About This Prompt
Data cleaning typically consumes 60-80% of any analysis project, yet it is often done ad-hoc with no documentation of what transformations were applied. This prompt generates a structured cleaning pipeline that handles the most common data quality issues while producing a detailed change report. The profiling step catches issues you might not know about, and the business rule validation ensures the cleaned data actually makes sense for your domain. The batch mode capability means the script can be automated for recurring data loads. Essential for analysts working with messy real-world data, data engineers building ingestion pipelines, or anyone preparing data for machine learning training.
Variables to Customize
[LANGUAGE]
Programming language for the script
Example: Python with pandas
[DATASET_DESCRIPTION]
What the dataset contains
Example: customer transactions from an e-commerce platform with 500K rows
[DATA_ISSUES]
Known data quality problems
Example: duplicate transaction IDs, negative prices, inconsistent date formats (MM/DD/YYYY and YYYY-MM-DD mixed), null email addresses, and product names with special characters
[OUTLIER_METHOD]
How to detect outliers
Example: IQR method with 1.5x multiplier
[BUSINESS_RULES]
Rules the data must satisfy
Example: prices must be between $0.01 and $10,000, quantities must be positive integers, transaction dates must be within the last 2 years
Tips for Best Results
- Always run the profiling step before defining your cleaning strategy
- Keep the original data untouched and create a cleaned copy
- Save the cleaning report for data lineage documentation
Example Output
```python
import pandas as pd
import numpy as np
from datetime import datetime
import logging
from tqdm import tqdm
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('data_cleaner')
class TransactionDataCleaner:
def __init__(self, filepath: str):
self.filepath = filepath
self.report = {'changes': [], 'stats': {}}
def profile(self, df: pd.DataFrame) -> dict:
"""Generate data quality profile"""
profile = {
'total_rows': len(df),
'total_columns': len(df.columns),
'null_percentages': (df.isnull().sum() / len(df) * 100).to_dict(),
'duplicate_rows': df.duplicated().sum()
}
logger.info(f"Profiled {profile['total_rows']} rows across {profile['total_columns']} columns")
return profile
```