Tao Luo

I am a CS Ph.D. candidate at the University of Pennsylvania (defending 2026), advised by Boon Thau Loo and Vincent Liu. I build AI agent infrastructure across RL post-training, LLM inference, and retrieval systems.

I am seeking full-time industry roles starting in 2026.

At Alibaba, I shipped Partial Overlapping, a scheduler for asynchronous agentic RL delivering 3.5x rollout throughput; it now powers production post-training of agent models at 100s of billion parameters on thousands of GPUs. I also pioneered AI-assisted systems engineering on the team: Partial Overlapping shipped with zero human-written code using coding agents (Claude Code, Codex), a first for Alibaba’s flagship post-training framework. I founded and lead RLix GitHub stars, an open-source orchestration layer for concurrent agentic RL pipelines (2.6x rollout throughput in SWE-agent RL training).

During my Ph.D. at Penn, I designed and built ParaFlex, a multiplexed heterogeneous LLM serving system that eliminates head-of-line blocking via stage-aligned parallelism. Earlier, during M.S. study at Columbia University, I introduced Privacy Budget Scheduling, and developed DPF as the first scheduling algorithm for ML training under differential privacy constraints. I also work on retrieval systems, with research spanning vector search, indexing (ScaleGANN), and query optimization (DeSCO). My work has appeared at OSDI, SOSP, and SoCC.

Before academia, I spent ~4 years in quant investment, developing strategies and building infrastructure. I hold a B.S. in Financial Mathematics from Southern University of Science and Technology.

Projects

Partial Overlapping: GPU Scheduler for Asynchronous Agentic RL @Alibaba, DAMO Academy

  • Identified idle-training-GPU underutilization as the dominant bottleneck in async agentic RL; proposed and led Partial Overlapping from design to production, a scheduler that places rollouts on idle training GPUs to deliver 3.5x rollout throughput. Later extended to async multi-LoRA fine-tuning on a shared Megatron-LM backbone.
  • Shipped into alibaba/ROLL; now powers production agentic RL post-training of models with 100s of billions of parameters on 1000s of GPUs, including Qoder IDE (coding agent), iFlow CLI (terminal agent), Amap (travel-planning agent), and Alimama (ads).
  • Contributed to the ROME model launch; work featured in the ROME technical report, as part of Alibaba’s open-source Agentic Learning Ecosystem (ALE).
  • Pioneered AI-assisted systems engineering on the team: built Partial Overlapping primarily using coding agents (Claude Code, Codex) with zero human-written code (first high-priority feature in alibaba/ROLL shipped this way); headlined the team’s public (English/Chinese) and internal technical blog posts as a case study for AI-assisted systems engineering.

RLix: Orchestration Layer for Concurrent Agentic RL Pipelines @Alibaba, DAMO Academy

  • Founded and lead RLix GitHub stars, an orchestration layer for concurrent agentic RL pipelines (stargazers include NVIDIA, Google, xAI, Anthropic, ByteDance, Zhipu AI).
  • Delivers 2.6x rollout throughput in SWE-agent RL training via elastic GPU sharing, with minimal changes to training recipes.
  • Designed the priority-based scheduling algorithm: rollout runs as the lowest-priority preemptible stage on idle GPUs, yielding to higher-priority stages (actor/critic training, log-probs) on demand.
  • Built the selective weight-sync mechanism: syncs latest weights only to scheduled rollout workers, keeping each RL pipeline’s memory footprint minimal.
  • Define and drive the roadmap, working with developers from the open-source community on WIP feature contributions.

Heterogeneous Multi-Model LLM Serving at Scale @University of Pennsylvania

  • Designed and built a multiplexed serving system with stage-aligned parallelism that eliminates head-of-line blocking, increasing token throughput by 1.6x while reducing median latency.
  • Extended vLLM with multi-model KV cache management, NCCL concurrency controls, and Ray-based distributed execution.
  • Developed algorithms for efficient model sharding, replication, placement, and scheduling across heterogeneous serving workloads.
  • ParaFlex, SoCC’25 paper

Query Optimization for Declarative Smart Contracts @UPenn

  • Framed efficiency of Datalog-compiled smart contracts as a view selection problem under a non-standard, history-dependent cost model (Ethereum gas).
  • Designed and implemented a selective view materialization algorithm with simplification-based pruning; formally proved correctness and pruning completeness.
  • Reduced storage gas by ~78% and total gas by >50% over naive compilation, matching expert hand-tuned Solidity on a benchmark of widely deployed contracts.
  • DeSCO paper, FAB’24 (co-located with VLDB).

Privacy-Preserving Scheduling for ML Training @Columbia University

  • Designed the first fair-allocation scheduling algorithm for ML training under differential-privacy constraints.
  • Improved job throughput by 2x over FCFS under the same privacy budget, verified in large-scale simulations; proved efficiency and fairness guarantees.
  • Privacy Budget Scheduling, OSDI’21

Honors & Service

  • Program Committee: ACM Symposium on Cloud Computing 2025
  • Manjushri Fellowship, University of Pennsylvania, 2021
  • China Merchant Bank Scholarship, 2012-2014
  • Pioneering Undergraduate Fellowship, 2011-2014