Data Cleaning and Transformation

Feature engineering begins with addressing data quality issues that can bias or confuse models.

  • Handling Missing Data: While median imputation is a safe baseline for skewed data, KNN Imputation provides more contextually accurate results by leveraging relationships between features, though it is computationally more expensive.
  • Outlier Management: Use the IQR method for skewed distributions (like car prices) to avoid losing valid data points. Z-score is only effective for roughly normal distributions. For extreme values, capping (winsorizing) at specific percentiles (e.g., 1st and 99th) is often safer than dropping rows.
  • Correcting Skewness: Highly skewed features (skewness > 1) hinder linear models. Log transformation (np.log1p) is effective for right-skewed data, while the Yeo-Johnson transformation is a more robust, automated alternative that handles various distribution shapes.

Feature Derivation and Scaling

Raw data often hides the true drivers of a target variable. Creating domain-specific features and ensuring proper scaling are critical for model convergence.

  • Feature Derivation: Convert raw timestamps into meaningful durations (e.g., CAR_AGE = 2024 - PROD_YEAR) and create ratios (e.g., PRICE_PER_KM) to capture value-for-usage metrics.
  • Scaling: Always scale numerical features when using distance-based models (KNN, SVM) or neural networks. RobustScaler is preferred over StandardScaler when the dataset contains significant outliers, as it uses median and IQR rather than mean and standard deviation.
  • Leakage Prevention: Always split data into training and testing sets before fitting scalers or encoders. Fitting on the entire dataset leaks information from the test set into the training process, leading to over-optimistic evaluation metrics.

Categorical Encoding Strategies

Choosing the right encoding method depends on a feature's cardinality (number of unique values) and the model type.

  • Low Cardinality (<10): One-Hot Encoding (OHE) is the standard. Use sklearn.preprocessing.OneHotEncoder rather than pd.get_dummies for production pipelines to ensure consistent handling of categories across train/test splits.
  • Medium to High Cardinality: Avoid OHE to prevent dimensionality explosion.
    • Target Encoding: Replaces categories with the mean of the target variable; highly effective but requires careful implementation to avoid data leakage.
    • Frequency/Count Encoding: Replaces categories with their occurrence frequency or raw count, capturing market concentration signals.
    • Binary Encoding: The most space-efficient method for high-cardinality features (e.g., 1,590 unique models), converting categories into binary bit columns. This is best suited for tree-based models where interpretability of individual bits is less critical.