Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2023
Training-as-a-service platforms facilitate users to deploy pre-configured Generative AI training jobs as batch workloads. The immutability of configuration offers minimal flexibility to dynamically adapt to training progress. Existing approaches invariably involve manually monitoring training progress on a dashboard, and the stop-reconfigure-restart of training does not scale well with number of experiments. Relying on pre-configuration, wastes computational resources and makes debugging of training jobs difficult. We address this gap through our training-control-as-code paradigm, which allows users to run user-defined code to analyze the training state and intervene to flag anomalies and save resource wastage. Our framework TrAC offers a declarative interface to allow for declaring desired control and for reusing it at scale. Using real-world open-source data and models we provide estimates on the savings in time and resource due to TrAC. We also provide demo video: https://youtu.be/RmhBfFjd1oA and code: https://github.com/foundation-model-stack/fms-hf-tuning/ blob/main/examples/trainercontroller configs/Readme.md
Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2023
Mehant Kammakomati, Sameer Pimparkhede, et al.
ACL 2025
Apoorve Mohan, Matthew Sheard
NVIDIA GTC 2022
Kaoru Shinkawa, Ai Ishida, et al.
ASE 2025