Claude Data Analysis intermediate

Data Cleaning Script Generator

Added Apr 1, 2026

You are a data engineer who specializes in data quality. Write a [LANGUAGE] script to clean and prepare a dataset about [DATASET_DESCRIPTION]. The raw data has these known issues: [DATA_ISSUES]. The script should: 1) Load the data and profile it (row count, column types, null percentages, unique values), 2) Handle missing values using appropriate strategies for each column (mean/median imputation, forward-fill, or flagging), 3) Detect and handle outliers using [OUTLIER_METHOD], 4) Standardize text fields (trim whitespace, consistent casing, fix encoding issues), 5) Validate data against business rules: [BUSINESS_RULES], 6) Create a cleaning report summarizing all changes made with before/after statistics, and 7) Export the cleaned dataset. Include logging, progress bars for large datasets, and the ability to run in both interactive and batch modes.

Try in Claude

About This Prompt

Data cleaning typically consumes 60-80% of any analysis project, yet it is often done ad-hoc with no documentation of what transformations were applied. This prompt generates a structured cleaning pipeline that handles the most common data quality issues while producing a detailed change report. The profiling step catches issues you might not know about, and the business rule validation ensures the cleaned data actually makes sense for your domain. The batch mode capability means the script can be automated for recurring data loads. Essential for analysts working with messy real-world data, data engineers building ingestion pipelines, or anyone preparing data for machine learning training.

Variables to Customize


[LANGUAGE]

Programming language for the script

Example: Python with pandas


[DATASET_DESCRIPTION]

What the dataset contains

Example: customer transactions from an e-commerce platform with 500K rows


[DATA_ISSUES]

Known data quality problems

Example: duplicate transaction IDs, negative prices, inconsistent date formats (MM/DD/YYYY and YYYY-MM-DD mixed), null email addresses, and product names with special characters


[OUTLIER_METHOD]

How to detect outliers

Example: IQR method with 1.5x multiplier


[BUSINESS_RULES]

Rules the data must satisfy

Example: prices must be between $0.01 and $10,000, quantities must be positive integers, transaction dates must be within the last 2 years

Tips for Best Results

Always run the profiling step before defining your cleaning strategy
Keep the original data untouched and create a cleaned copy
Save the cleaning report for data lineage documentation

Example Output

```python
import pandas as pd
import numpy as np
from datetime import datetime
import logging
from tqdm import tqdm

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('data_cleaner')

class TransactionDataCleaner:
    def __init__(self, filepath: str):
        self.filepath = filepath
        self.report = {'changes': [], 'stats': {}}
    
    def profile(self, df: pd.DataFrame) -> dict:
        """Generate data quality profile"""
        profile = {
            'total_rows': len(df),
            'total_columns': len(df.columns),
            'null_percentages': (df.isnull().sum() / len(df) * 100).to_dict(),
            'duplicate_rows': df.duplicated().sum()
        }
        logger.info(f"Profiled {profile['total_rows']} rows across {profile['total_columns']} columns")
        return profile
```

data-cleaning ETL data-quality pandas preprocessing