Work / Research / Anna-Verse 2.0

Competition: Ana-Verse 2.0 · Kaggle Metric: F1 Score Task: Binary anomaly detection

No. 45 · Research · Machine Learning

1.6 million readings.
0.86% anomalies.
F1, not accuracy.

An ensemble built for the imbalance the metric actually rewards.

Abstract

The Ana-Verse 2.0 dataset is 1.6 million timestamped readings from five sensors at an energy plant. Only 0.86% of samples are anomalies · a 115:1 class ratio that punishes naive classifiers. Two of the sensors return values up to ~1e38, an exponential scale that destroys tree splits unless transformed. This solution combines log-transforms, 60+ engineered features, a 3-model gradient-boosted ensemble (LightGBM, XGBoost, CatBoost), per-fold SMOTE inside 5-fold stratified CV, and a joint grid search over ensemble weights and the decision threshold, optimised end-to-end for F1.

1.6M+Training rows
115 : 1Class imbalance
60+Engineered features
3-stackLGBM · XGB · CatBoost
5-foldStratified CV with SMOTE

Act I · The Problem

The model that guesses normal scores 99.14% accuracy.

Accuracy is the wrong metric. The metric is F1. So every percentage of the 0.86% matters more than the entire 99.14% of normal readings put together.

A naive classifier that always predicts normal is right 99.14% of the time and useless.

  • Class 0 · normal99.14%
  • Class 1 · anomaly0.86%
  • Ratio115 : 1
  • X3, X4 max~1e38

Act II · The Techniques

Six moves the leaderboard actually rewards.

Each move targets a specific failure mode of naive baselines on this exact dataset.

01

log1p on X3 and X4.

Two sensors return exp()-scale values. log1p() recovers the linear scale and dramatically improves tree splits · this single move dominates the rest.

02

60+ engineered features.

Row-wise stats, log-scale stats, pairwise ratios, differences, products, polynomial squared terms, cyclical hour/day-of-week/month encodings.

03

SMOTE per fold (0.35).

Resampling applied inside each CV fold, never across folds · avoids leakage that makes offline scores lie.

04

5-fold stratified CV.

Early stopping per fold. Out-of-fold probabilities collected for every model · used for honest threshold calibration, not for picking the best fold.

05

Joint weight + threshold search.

Grid over ensemble weights (step 0.05) and decision threshold (step 0.01). Maximises F1 directly. Decoupling the two is suboptimal here.

06

Full-data retrain.

Final models retrain on the full dataset for ~800 iterations · the average early-stopped count across folds. Submission is produced from the retrained ensemble.

Act III · The Architecture

Pipeline, end to end.

No mystery. Every step is in solution.py. The Colab notebook adds GPU acceleration for XGBoost and CatBoost.

solution.py · pipeline
  Raw Data (5 sensors X1-X5 + Date)
      |
  Feature Engineering (60+ features)
      |
  5-Fold Stratified CV
      |-- per fold: RobustScaler -> SMOTE 0.35 -> train LGB / XGB / CB with early stopping
      |-- collect out-of-fold probabilities
      |
  Grid Search: ensemble weights (step 0.05) + threshold (step 0.01)  ->  maximise F1
      |
  Retrain on full data (SMOTE + ~800 iterations)
      |
  Weighted ensemble prediction  ->  apply threshold  ->  submission.parquet + submission.csv

Act IV · The Ensemble

Three boosted models. Each with a job.

Three different libraries see the same data three different ways. The ensemble averages out their respective bias, then the threshold search picks the F1-maximising operating point.

ModelTrees / itersDepthLRClass weightWhy it's here
LightGBM 2000100.03balanced Fast, leaf-wise growth, 63 leaves · finds tight splits in the engineered ratio features.
XGBoost 200080.03scale_pos_weight = 50 Histogram method, hardest on the minority class, complements LGBM's recall pattern.
CatBoost 200080.03auto-balanced Symmetric trees + ordered boosting reduce target leakage on the cyclical date features.

The Stack

Reproducible. One Python file. One Colab notebook.

  • Python 3.11+
  • LightGBM
  • XGBoost · gpu_hist
  • CatBoost · GPU
  • scikit-learn
  • imblearn · SMOTE
  • RobustScaler
  • pandas · pyarrow
  • Jupyter / Colab T4

Act V · Proof

Public submission. Reproducible run.

solution.py · local run

pip install -r requirements.txt, drop the parquet files in the project root, python solution.py. Outputs submission.parquet and submission.csv.

solution_colab.ipynb · T4 GPU

Upload to Colab, set runtime to T4 GPU, run all. XGBoost runs gpu_hist, CatBoost runs GPU-accelerated. Submission downloads automatically.

exploration.ipynb · full EDA

Class distribution, sensor histograms (incl. the X3 / X4 exponential tail), feature importances, model comparison, threshold sensitivity curves.

No leakage by design

Scaling and SMOTE happen inside the fold. The threshold is calibrated on out-of-fold probabilities, never on training-fold probabilities.

Anomaly detection that survives the imbalance.

I build production-aware ML · the kind that names its assumptions, calibrates its thresholds and ships a reproducible run, not a screenshot of a leaderboard.