Kubernetes x JobSet: How Co-evolution Makes AI Job Restarts 10× Faster¶
In the fast-moving world of AI infrastructure, a powerful synergy is emerging: the Kubernetes community develops core capabilities, while downstream projects such as JobSet, Ray, and LeaderWorkerSet (LWS) adopt these features to deliver dramatic efficiency gains. We call this co-evolution—the entire ecosystem moving forward together.
Kubernetes has recently introduced a growing set of AI-related capabilities. However, to fully unlock their potential for AI workloads, other projects must adapt to them. Today, we explore a representative example:
JobSet achieves a 92% restart speed improvement by leveraging Kubernetes in-place container restarts.
The Problem: Slow JobSet Restarts¶
When a distributed training job running on JobSet needs to restart—due to transient failures, configuration updates, or checkpoint recovery—the traditional approach involves:
- Deleting all Pods in the JobSet
- Waiting for Pod termination to complete
- Re-scheduling all Pods via the Kubernetes scheduler
- Waiting for Pods to start (including image pulls, init containers, etc.)
In a large-scale cluster with 5,000 nodes, this process takes about 2 minutes and 10 seconds. For AI/ML workloads where fast recovery is critical, this overhead is significant.
The Solution: In-Place Container Restarts¶
Kubernetes has introduced capabilities that allow containers to restart without recreating the Pod:
KEP-5307: Container Restart Policy (Kubernetes 1.34)¶
KEP-5307 introduces fine-grained control over restart behavior for individual containers within a Pod. This enables:
- Specifying restart policies per container (not just per Pod)
- Triggering container restarts without affecting the entire Pod
- Preserving Pod identity, IP, and volumes during restarts
KEP-5532: Restart All Containers on Container Exit (Kubernetes 1.35)¶
KEP-5532 extends this capability to coordinated restarts:
- Restarting all containers in a Pod when a specific container exits
- Restarting init containers and sidecars as part of the Pod lifecycle
- Enabling Pod-level restart coordination without Pod recreation
Real-World Results: JobSet In-Place Restarts¶
The JobSet team developed an in-place restart prototype that demonstrates dramatic performance improvements:
| Metric | Traditional Restart | In-Place Restart | Improvement |
|---|---|---|---|
| Restart time | 2 min 10 sec | 10 sec | 92% faster |
| Test scale | 5,000 nodes | 5,000 nodes | – |
| Scheduling overhead | High | None | Eliminated |
| Pod recreation | Required | Not required | Avoided |
For detailed design information, see the JobSet in-place restart design document.
Why This Matters for AI Workloads¶
1. Distributed Training Recovery¶
Large-scale distributed training jobs (PyTorch DDP, TensorFlow MultiWorkerMirroredStrategy) are especially sensitive to restart latency:
- Checkpoint recovery: After a failure, all workers must restart from the latest checkpoint. In-place restarts make worker recovery 12× faster.
- Gradient synchronization: Training can only proceed when all workers are running. Faster restarts mean less wasted GPU time.
- Cost savings: On expensive GPU clusters ($2–10 per GPU-hour), saving 2 minutes per restart quickly adds up.
2. Job Dependencies¶
Many AI pipelines have complex job dependencies. When a job restarts:
- Downstream jobs wait for upstream completion
- Gang scheduling constraints require all workers to be present
- Network connections must be preserved for collective operations
In-place restarts preserve Pod identity and network connections, minimizing disruption to the overall pipeline.
3. Resource Efficiency¶
Traditional restarts involve:
- Scheduler load: Finding nodes for potentially thousands of Pods
- API server load: Creating and deleting Pod objects
- Node preparation: Image pulls, volume mounts, init containers
In-place restarts eliminate all of this overhead, reserving resources for actual workloads.
How It Works¶
Before: Traditional Restart Flow¶
Trigger job restart
↓
Delete all Pods → wait for termination (30s+)
↓
Create new Pods → wait for scheduling (30s+)
↓
Pull images (if needed) → start containers (60s+)
↓
Total: ~2 min 10 sec
````
### After: In-Place Restart Flow
```text
Trigger job restart
↓
Send container exit signal → containers restart in place (10s)
↓
Total: ~10 sec
Key differences:
- No Pod deletion: Pod objects are preserved, maintaining identity
- No re-scheduling: Pods remain on their current nodes
- No image pulls: Images are already cached on the node
- Immediate restart: Container processes restart directly
Implementation Considerations¶
When to Use In-Place Restarts¶
- Transient failures: Container crashes, OOM kills, network timeouts
- Configuration updates: Restarting to pick up new environment variables
- Checkpoint recovery: Resuming training from saved state
- Rolling restarts: Gracefully restarting workers in sequence
When Traditional Restarts Are Required¶
- Node failures: Pods must move to healthy nodes
- Resource changes: Pods need more or fewer resources (consider VPA)
- Image updates: A new container image is required
- Topology changes: Pods need different placement
Integrating with JobSet¶
JobSet can leverage in-place restarts as follows:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: distributed-training
spec:
replicatedJobs:
- name: workers
replicas: 8
template:
spec:
template:
spec:
restartPolicy: Always # Enable in-place restarts
containers:
- name: trainer
image: pytorch/pytorch:latest
The Broader Co-evolution Pattern¶
This JobSet improvement is a classic example of co-evolution in cloud-native AI:
| Kubernetes Capability | Project Adoption | Benefit |
|---|---|---|
| In-place restart | JobSet | 92% faster recovery |
| Gang scheduling (1.35) | Kueue, LWS | All-or-nothing placement |
| DRA (1.34 GA) | NVIDIA GPU Operator | Flexible device allocation |
| Workload API (1.35) | Volcano, YuniKorn | Native workload support |
As Kubernetes continues to add AI-friendly features, we expect more projects to adopt them, creating a virtuous cycle of improvement.
Getting Started¶
Prerequisites¶
- Kubernetes 1.34+ (for KEP-5307)
- Kubernetes 1.35+ (for KEP-5532 Pod-level restarts)
- A JobSet version that supports in-place restarts (check the latest release)
Enable Feature Gates¶
# Enable KEP-5307 (Container Restart Policy, 1.34+) on kubelet
--feature-gates=ContainerRestartPolicy=true
# Enable KEP-5532 (Restart All Containers, 1.35+) on kubelet
--feature-gates=RestartAllContainersOnContainerExits=true
Test In-Place Restarts¶
- Deploy a JobSet with
restartPolicy: Always - Trigger a container restart (e.g.,
kubectl exec ... -- kill -TERM 1) - Observe the restart time compared to Pod recreation
Future Roadmap¶
In-place restart capabilities continue to evolve:
- KEP-5307 graduation: Moving toward Beta/GA
- KEP-5532 enhancements: More robust Pod-level restart control
- JobSet integration: Native support for in-place restart policies
- Observability: Better visibility into restart events
- Kueue integration: Workload-aware restart handling
Conclusion¶
The JobSet in-place restart optimization showcases the power of co-evolution in the Kubernetes ecosystem. By adopting upstream Kubernetes capabilities, projects can achieve significant performance gains:
- 92% faster restarts (2 min 10 sec → 10 sec)
- Zero scheduling overhead
- Preserved Pod identity and networking
- Reduced API server load
This is just one example of how the Kubernetes community and downstream projects collaborate to improve AI workload efficiency. As more AI-related features land in Kubernetes, we can expect JobSet, Ray, LWS, and others to deliver even more optimizations.
The future of AI infrastructure is co-evolution—and it’s already happening.
References¶
KEPs and Documentation¶
- KEP-5307: Container Restart Policy
- KEP-5532: Restart All Containers on Container Exit
- KEP-1287: In-Place Pod Vertical Scaling
- JobSet In-Place Restart Design Doc
- JobSet In-Place Restart Prototype
Related Projects¶
- JobSet – Kubernetes SIG Apps
- LeaderWorkerSet – Kubernetes SIG Apps
- Kueue – Kubernetes SIG Scheduling
- Volcano – CNCF Incubating