11:30 - 11:45 Best Short Paper
CGSim: A Simulation Framework for Large Scale Distributed Computing Environment
Sairam Sri Vatsavai, Kuan-Chieh Hsu, Ozgur Kilic, Yihui (Ray) Ren, David Park, Paul Nilsson, Sankha Dutta, Tasnuva Chowdhury, Adolfy Hoisie, Tadashi Maeno, Shinjae Yoo, Alexei Klimentov
Brookhaven National Laboratory, USA
Raees Khan Ahmed, Tania Korchuganova, Joseph Boudreau
University of Pittsburgh, USA
Shengyu Feng, Yiming Yang
Carnegie Mellon University, USA
Fatih Furkan Akman, Verena Ingrid Martinez Outschoorn, John Rembrandt (Remy) Steele
University of Massachusetts, USA
Scott Klasky, Norbert Podhorszki, Fred Suter
Oak Ridge National Laboratory, USA
Wei Yang
SLAC National Accelerator Laboratory, USA
Large-scale distributed computing infrastructures like the Worldwide LHC Computing Grid (WLCG) require comprehensive simulation tools for performance evaluation and resource optimization. Existing simulators suffer from limited scalability, hardwired algorithms, lack of real-time monitoring, and inability to generate machine learning-suitable datasets.We present CGSim, a simulation framework addressing these limitations. Built on the validated SimGrid framework, CGSim provides high-level abstractions for modeling heterogeneous grid environments while maintaining accuracy and scalability. Key features include a modular plugin mechanism for testing custom workflow policies, interactive real-time visualization dashboards, and automatic generation of event-level datasets for AI-assisted performance modeling. Comprehensive evaluation using production ATLAS PanDA workloads demonstrates significant calibration accuracy improvements across WLCG sites. Scalability experiments show near-linear scaling for multi-site simulations, with distributed workloads achieving 6× better performance than single-site execution. CGSim enables researchers to simulate WLCG-scale infrastructures with hundreds of sites and thousands of concurrent jobs on commodity hardware within practical time budgets.
