01
log1p on X3 and X4.
Two sensors return exp()-scale values. log1p() recovers the linear scale and dramatically improves tree splits · this single move dominates the rest.
Work / Research / Anna-Verse 2.0
No. 45 · Research · Machine LearningAn ensemble built for the imbalance the metric actually rewards.
The Ana-Verse 2.0 dataset is 1.6 million timestamped readings from five sensors at an energy plant. Only 0.86% of samples are anomalies · a 115:1 class ratio that punishes naive classifiers. Two of the sensors return values up to ~1e38, an exponential scale that destroys tree splits unless transformed. This solution combines log-transforms, 60+ engineered features, a 3-model gradient-boosted ensemble (LightGBM, XGBoost, CatBoost), per-fold SMOTE inside 5-fold stratified CV, and a joint grid search over ensemble weights and the decision threshold, optimised end-to-end for F1.
Act I · The Problem
Accuracy is the wrong metric. The metric is F1. So every percentage of the 0.86% matters more than the entire 99.14% of normal readings put together.
A naive classifier that always predicts normal is right 99.14% of the time and useless.
Act II · The Techniques
Each move targets a specific failure mode of naive baselines on this exact dataset.
01
Two sensors return exp()-scale values. log1p() recovers the linear scale and dramatically improves tree splits · this single move dominates the rest.
02
Row-wise stats, log-scale stats, pairwise ratios, differences, products, polynomial squared terms, cyclical hour/day-of-week/month encodings.
03
Resampling applied inside each CV fold, never across folds · avoids leakage that makes offline scores lie.
04
Early stopping per fold. Out-of-fold probabilities collected for every model · used for honest threshold calibration, not for picking the best fold.
05
Grid over ensemble weights (step 0.05) and decision threshold (step 0.01). Maximises F1 directly. Decoupling the two is suboptimal here.
06
Final models retrain on the full dataset for ~800 iterations · the average early-stopped count across folds. Submission is produced from the retrained ensemble.
Act III · The Architecture
No mystery. Every step is in solution.py. The Colab notebook adds GPU acceleration for XGBoost and CatBoost.
Raw Data (5 sensors X1-X5 + Date) | Feature Engineering (60+ features) | 5-Fold Stratified CV |-- per fold: RobustScaler -> SMOTE 0.35 -> train LGB / XGB / CB with early stopping |-- collect out-of-fold probabilities | Grid Search: ensemble weights (step 0.05) + threshold (step 0.01) -> maximise F1 | Retrain on full data (SMOTE + ~800 iterations) | Weighted ensemble prediction -> apply threshold -> submission.parquet + submission.csv
Act IV · The Ensemble
Three different libraries see the same data three different ways. The ensemble averages out their respective bias, then the threshold search picks the F1-maximising operating point.
| Model | Trees / iters | Depth | LR | Class weight | Why it's here |
|---|---|---|---|---|---|
| LightGBM | 2000 | 10 | 0.03 | balanced | Fast, leaf-wise growth, 63 leaves · finds tight splits in the engineered ratio features. |
| XGBoost | 2000 | 8 | 0.03 | scale_pos_weight = 50 | Histogram method, hardest on the minority class, complements LGBM's recall pattern. |
| CatBoost | 2000 | 8 | 0.03 | auto-balanced | Symmetric trees + ordered boosting reduce target leakage on the cyclical date features. |
The Stack
Act V · Proof
pip install -r requirements.txt, drop the parquet files in the project root, python solution.py. Outputs submission.parquet and submission.csv.
Upload to Colab, set runtime to T4 GPU, run all. XGBoost runs gpu_hist, CatBoost runs GPU-accelerated. Submission downloads automatically.
Class distribution, sensor histograms (incl. the X3 / X4 exponential tail), feature importances, model comparison, threshold sensitivity curves.
Scaling and SMOTE happen inside the fold. The threshold is calibrated on out-of-fold probabilities, never on training-fold probabilities.
I build production-aware ML · the kind that names its assumptions, calibrates its thresholds and ships a reproducible run, not a screenshot of a leaderboard.