Tao Luo
I am a CS Ph.D. candidate at the University of Pennsylvania (defending 2026), advised by Profs. Boon Thau Loo and Vincent Liu. I build orchestration and scheduling systems for LLM workloads, from agentic RL post-training to LLM inference.
I am seeking full-time industry roles starting in 2026.
At Alibaba, I designed and shipped Partial Overlapping, a GPU scheduling optimization in ROLL that was critical to production-scale agentic RL training. The work is featured in the ROME technical report and used by multiple products operating at the scale of 100B+ parameters and thousands of GPUs. I also open-sourced RLix , drawing interest from engineers at NVIDIA, Google, xAI, Anthropic, ByteDance, Zhipu AI, etc. My work spans vLLM, Megatron-LM, and Ray.
My work has appeared at OSDI, SOSP, and SoCC. During my Ph.D. at Penn, I led ParaFlex, a multiplexed heterogeneous LLM serving system that eliminates head-of-line blocking via stage-aligned parallelism. Earlier, during M.S. study at Columbia University, I introduced Privacy Budget Scheduling, and developed DPF as the first scheduling algorithm for ML training under differential privacy constraints.
Before academia, I developed quantitative investment algorithms in finance. I hold a B.S. in Financial Mathematics from Southern University of Science and Technology.
Selected Projects
GPU Scheduling and RL Infrastructure @Alibaba, DAMO Academy
- Proposed Partial Overlapping, a scheduling mechanism for asynchronous agentic RL that reassigns idle training GPUs to rollout workers, improving rollout throughput by 3.5x.
- Built Partial Overlapping entirely through AI coding (English/Chinese); it was a high-priority feature in alibaba/ROLL and the first built with zero human-written code.
- Partial Overlapping is used in production for RL training of models with 100s of billions of parameters on 1000s GPUs, including Qoder IDE (coding), iFlow CLI (coding), Amap (travel planning), and Alimama (ads).
- Extended Partial Overlapping to async multi-LoRA RL via per-adapter optimizers on a shared Megatron base model.
- Built RLix
, an orchestration layer for concurrent LLM RL that enables elastic GPU sharing across pipelines with minimal changes to training recipes; attracted interest from engineers at NVIDIA, Google, xAI, Anthropic, ByteDance, Zhipu AI, etc.
ParaFlex: Multiplexed Heterogeneous LLM Serving via Stage-Aligned Parallelism @University of Pennsylvania
- Proposed a novel LLM serving architecture that eliminates head-of-line blocking and improves token throughput by 1.6x.
- Built multi-model KV cache management and robust NCCL concurrency controls.
- Optimized sharding, replication, placement, and scheduling algorithms for heterogeneous serving workloads.
- SoCC’25 paper
Privacy Budget Scheduling in ML Training @Columbia University
- Introduced Privacy Budget Scheduling and showed how to schedule 2x more jobs than FCFS under the same privacy budget.
- Developed DPF (Dominant Private Block Fairness), the first scheduling algorithm for ML training under differential privacy constraints, derived from DRF, and proved its game-theoretic properties.
- OSDI’21 paper
Honors & Service
- Program Committee: ACM Symposium on Cloud Computing 2025
- Manjushri Fellowship, University of Pennsylvania, 2021
- Financial Risk Manager (FRM) Certification, 2015
- China Merchant Bank Scholarship, 2012-2014
- Pioneering Undergraduate Fellowship, 2011-2014
- First Prize, China High School Biology Olympiad, 2010
