Tao Luo
I am a CS Ph.D. candidate at the University of Pennsylvania (defending 2026), advised by Profs. Boon Thau Loo and Vincent Liu.
My research is on GPU scheduling and post-training infrastructure for extreme-scale LLMs, with a focus on agentic RL and heterogeneous LLM serving. I advanced GPU resource allocation in Alibaba’s ROLL framework, with production systems deployed across 1000+ GPUs training 100B+ parameter models.
Previously at Columbia University (M.S.), I coined privacy budget scheduling, leading the first study on scheduling ML training under differential privacy constraints. I was advised by Prof. Asaf Cidon and collaborated broadly with Profs. Ethan Katz-Bassett, Ryan Stutsman, Mathias Lécuyer, and Roxana Geambasu.
Before academia, I developed quantitative investment algorithms in the financial industry. I hold a B.S. in Financial Mathematics from Southern University of Science and Technology, as a member of its founding cohort.
Selected Projects
GPU Scheduling for Agentic RL @Alibaba
- Designed and implemented a Partial Overlapping GPU scheduling algorithm for asynchronous agentic RL: reassigns idle training GPUs to rollout workers, maximizing GPU utilization.
- Enabled concurrent multi-LoRA RL via per-adapter optimizer states on a shared Megatron base model with cross-engine weight synchronization.
- Architected multi-tenant RL scheduling: decoupled per-job training logic from global GPU allocation via heartbeat progress, versioned checkpoints, and selective weight syncing (open-source in progress: rlops/rlix).
- Deployed in production (100B+ parameters, 1000+ GPUs): Qoder IDE (coding), iFlow CLI (coding), Amap (travel planning), and Alimama (ads).
- Pioneered vibe coding in production: Partial Overlapping was the first feature shipped from first commit to production with zero human-written code, via disciplined human-AI collaboration loops (English/Chinese).
ParaFlex: Multiplexed Heterogeneous LLM Serving via Stage-Aligned Parallelism @UPenn
- Eliminated head-of-line blocking via novel LLM serving architecture, raising token throughput by 1.6×.
- Built efficient multi-model KV cache management and robust NCCL concurrency controls.
- Optimized sharding, replication, placement, and scheduling strategies.
- SoCC’25 paper
Privacy Budget Scheduling in ML Training @Columbia
- Scheduled 2× more jobs than FCFS under identical privacy budgets.
- Proposed a dynamic algorithm DPF (Dominant Private Block Fairness) based on DRF (dominant resource fairness).
- Developed rigorous proofs for the game-theory properties of the new algorithm.
- OSDI’21 paper
Honors & Service
- Program Committee: ACM Symposium on Cloud Computing 2025
- Manjushri Fellowship, University of Pennsylvania, 2021
- Financial Risk Manager (FRM) Certification, 2015
- China Merchant Bank Scholarship, 2012-2014
- Pioneering Undergraduate Fellowship, 2011-2014
- First Prize, China High School Biology Olympiad, 2010
