Mode 01
Decision Trees overfit to clean-room temporal patterns
Sequential attacks produce predictable inter-arrival times. The tree latches onto the cadence. Concurrent attacks scramble the cadence. Accuracy collapses.
Work / Enterprise & Security / SDN IDS Testbed
Every SDN intrusion-detection paper claims 95 to 99 percent accuracy on a benchmark dataset. Production drops it to 70. Nobody publishes that gap. This testbed measures it: 3.5 million flow events, 23 attack classes, sequential clean-room versus concurrent production-realistic, on the same models, on the same VM, on the same day.
3.5M
A controlled measurement of the gap that methodology, not models, is responsible for.
Act I · The Abstract
SDN intrusion-detection papers routinely report ninety-five to ninety-nine percent accuracy on benchmark datasets. We demonstrate that these results are artifacts of clean-room evaluation, not genuine detection capability. On 3.5 million real flow events across 23 attack classes, the best-performing model's accuracy drops from 79.04 percent to 71.10 percent, an 8 percentage point decline caused entirely by evaluation methodology. This work extends the evaluation critique of Arp et al. (USENIX Security 2022) to SDN flow-based detection.
Act II · The Result
Five classifiers. Two evaluation regimes. One labelled dataset. The drop is not the model's failure. It is the field's.
| Regime | Best classifier | Accuracy | Conditions |
|---|---|---|---|
| V1 · Clean-room | Random Forest | 79.04 % | Sequential single-attack runs. Single switch. Unencrypted. Standard benchmark conditions. |
| V2 · Production-realistic | Random Forest | 71.10 % | Concurrent attacks. Multi-switch topology. Encrypted traffic. False-positive-inducing legitimate operations. |
| Gap | same model, same data | - 7.94 pp | Caused entirely by evaluation methodology, not model capability. |
Mode 01
Sequential attacks produce predictable inter-arrival times. The tree latches onto the cadence. Concurrent attacks scramble the cadence. Accuracy collapses.
Mode 02
Five attack classes work. Twenty-three do not. Conditional independence assumptions break with overlapping flow distributions.
Mode 03
Stealth SYN scans and slow-and-low blackhole injections produce flow statistics that are statistically indistinguishable from legitimate operations.
Act III · The Testbed
One e2-standard-4 GCP VM. Mininet network. Ryu controller speaking OpenFlow 1.3. Redis-buffered ingest into time-partitioned PostgreSQL at fifty thousand flows per second. Traffic generators for both legitimate and attack workloads. Flask dashboard. Terraform to bring it up.
terraform apply brings up the VM with all dependencies pre-installed.scripts/run_experiment.py for V1 sequential attacks.scripts/run_experiment_v2.py for V2 concurrent attacks.ml/train.py and ml/train_v2.py retrain both regimes for a side-by-side accuracy delta.gs://lmsforshantithakur-sdn-flow-data/.Act IV · The Citation
If this testbed or the dataset informs your research, please cite the manuscript-in-preparation. The full dataset is available on request for academic collaboration.
@article{chauhan2026beyond,
title = {Beyond Clean-Room Evaluation: Measuring the {SDN} {IDS}
Accuracy Gap Under Production-Realistic Attack Conditions},
author = {Chauhan, Riya and Sonia and Mohan, Divya},
year = {2026},
note = {Manuscript in preparation for IEEE Transactions on
Information Forensics and Security (TIFS)}
}
Act V · Proof
Password-protected dashboard. Real-time flow stats. Per-attack accuracy. Academic paper rendering template included so the result is browsable like a publication.
SYN flood, UDP flood, port scan, stealth SYN scan, blackhole flow rule injection, flow-table overflow. V2 variants run concurrently.
Random Forest, Decision Tree, Naive Bayes, K-Nearest Neighbours, Gradient Boosting. Identical preprocessing. Identical features. Identical splits.
Hourly partitions. Fifty thousand flows per second sustained ingest. Backed up to GCS for reproducibility.
The general evaluation-pitfalls critique made specific to SDN flow-based detection. The first controlled SDN-specific measurement of the gap.
Code is open. Dataset is shared on request for academic collaboration. Commercial use requires written permission from the authors.
I build and measure security systems honestly. Bench-grade numbers are easy. Production numbers are the only ones that matter. The testbed exists so the field stops pretending the two are the same.