Work / Enterprise & Security / SDN IDS Testbed

Manuscript · in preparation · IEEE TIFS 2026 · Open testbed · Closed dataset

No. 32 · Enterprise · Network Security · Research

The eight points nobody publishes.

Every SDN intrusion-detection paper claims 95 to 99 percent accuracy on a benchmark dataset. Production drops it to 70. Nobody publishes that gap. This testbed measures it: 3.5 million flow events, 23 attack classes, sequential clean-room versus concurrent production-realistic, on the same models, on the same VM, on the same day.

Authors Riya Chauhan, PhD Scholar · Sonia, Professor & Guide · Divya Mohan, B.Tech CSE Cybersecurity "Beyond Clean-Room Evaluation: Measuring the SDN IDS Accuracy Gap Under Production-Realistic Attack Conditions" · Manuscript in preparation for IEEE TIFS · 2026

View source Live dashboard Benchmark table

3.5^M

Real flow events · 23 attack classes · live SDN testbed

A controlled measurement of the gap that methodology, not models, is responsible for.

3.5MFlow events labelled

23Attack classes

5Models compared head-to-head

7.94 ppAccuracy gap, V1 to V2

50k / sPostgreSQL ingest rate

Act I · The Abstract

A controlled measurement, not a guess.

SDN intrusion-detection papers routinely report ninety-five to ninety-nine percent accuracy on benchmark datasets. We demonstrate that these results are artifacts of clean-room evaluation, not genuine detection capability. On 3.5 million real flow events across 23 attack classes, the best-performing model's accuracy drops from 79.04 percent to 71.10 percent, an 8 percentage point decline caused entirely by evaluation methodology. This work extends the evaluation critique of Arp et al. (USENIX Security 2022) to SDN flow-based detection.

i. Sequential single-attack experiments overstate accuracy.
ii. Concurrent attacks plus encrypted traffic plus legitimate operations break that overstatement.
iii. Multi-switch topologies hide stealth APT techniques inside legitimate flow statistics.
iv. Models do not get worse. The yardstick gets honest.

Act II · The Result

Same models. Honest yardstick.

Five classifiers. Two evaluation regimes. One labelled dataset. The drop is not the model's failure. It is the field's.

Best-model accuracy across evaluation regimes · 3.5M labelled flow events · 23 attack classes
Regime	Best classifier	Accuracy	Conditions
V1 · Clean-room	Random Forest	79.04 %	Sequential single-attack runs. Single switch. Unencrypted. Standard benchmark conditions.
V2 · Production-realistic	Random Forest	71.10 %	Concurrent attacks. Multi-switch topology. Encrypted traffic. False-positive-inducing legitimate operations.
Gap	same model, same data	- 7.94 pp	Caused entirely by evaluation methodology, not model capability.

Failure modes identified

Mode 01

Decision Trees overfit to clean-room temporal patterns

Sequential attacks produce predictable inter-arrival times. The tree latches onto the cadence. Concurrent attacks scramble the cadence. Accuracy collapses.

Mode 02

Naive Bayes collapses under class proliferation

Five attack classes work. Twenty-three do not. Conditional independence assumptions break with overlapping flow distributions.

Mode 03

All models lose to stealth APTs

Stealth SYN scans and slow-and-low blackhole injections produce flow statistics that are statistically indistinguishable from legitimate operations.

Act III · The Testbed

Reproducible. End to end.

One e2-standard-4 GCP VM. Mininet network. Ryu controller speaking OpenFlow 1.3. Redis-buffered ingest into time-partitioned PostgreSQL at fifty thousand flows per second. Traffic generators for both legitimate and attack workloads. Flask dashboard. Terraform to bring it up.

› terraform apply brings up the VM with all dependencies pre-installed.
› scripts/run_experiment.py for V1 sequential attacks.
› scripts/run_experiment_v2.py for V2 concurrent attacks.
› ml/train.py and ml/train_v2.py retrain both regimes for a side-by-side accuracy delta.
› Backups to gs://lmsforshantithakur-sdn-flow-data/.

Act IV · The Citation

Cite the work. Use the dataset.

If this testbed or the dataset informs your research, please cite the manuscript-in-preparation. The full dataset is available on request for academic collaboration.

BibTeX

@article{chauhan2026beyond,
  title   = {Beyond Clean-Room Evaluation: Measuring the {SDN} {IDS}
             Accuracy Gap Under Production-Realistic Attack Conditions},
  author  = {Chauhan, Riya and Sonia and Mohan, Divya},
  year    = {2026},
  note    = {Manuscript in preparation for IEEE Transactions on
             Information Forensics and Security (TIFS)}
}

Act V · Proof

Open testbed. Live dashboard.

Live · sdn-riya.dmj.one

Password-protected dashboard. Real-time flow stats. Per-attack accuracy. Academic paper rendering template included so the result is browsable like a publication.

Six attack implementations

SYN flood, UDP flood, port scan, stealth SYN scan, blackhole flow rule injection, flow-table overflow. V2 variants run concurrently.

Five classifiers compared

Random Forest, Decision Tree, Naive Bayes, K-Nearest Neighbours, Gradient Boosting. Identical preprocessing. Identical features. Identical splits.

Time-partitioned PostgreSQL

Hourly partitions. Fifty thousand flows per second sustained ingest. Backed up to GCS for reproducibility.

Extends Arp et al. USENIX 2022

The general evaluation-pitfalls critique made specific to SDN flow-based detection. The first controlled SDN-specific measurement of the gap.

Research / educational use

Code is open. Dataset is shared on request for academic collaboration. Commercial use requires written permission from the authors.

If your IDS works in the lab, does it work on a Tuesday?

I build and measure security systems honestly. Bench-grade numbers are easy. Production numbers are the only ones that matter. The testbed exists so the field stops pretending the two are the same.

Hire me Back to work