Kubernetes x JobSet: How Co-evolution Makes AI Job Restarts 10× Faster¶

In the fast-moving world of AI infrastructure, a powerful synergy is emerging: the Kubernetes community develops core capabilities, while downstream projects such as JobSet, Ray, and LeaderWorkerSet (LWS) adopt these features to deliver dramatic efficiency gains. We call this co-evolution—the entire ecosystem moving forward together.

Kubernetes has recently introduced a growing set of AI-related capabilities. However, to fully unlock their potential for AI workloads, other projects must adapt to them. Today, we explore a representative example:

JobSet achieves a 92% restart speed improvement by leveraging Kubernetes in-place container restarts.

The Problem: Slow JobSet Restarts¶

When a distributed training job running on JobSet needs to restart—due to transient failures, configuration updates, or checkpoint recovery—the traditional approach involves:

Deleting all Pods in the JobSet
Waiting for Pod termination to complete
Re-scheduling all Pods via the Kubernetes scheduler
Waiting for Pods to start (including image pulls, init containers, etc.)

In a large-scale cluster with 5,000 nodes, this process takes about 2 minutes and 10 seconds. For AI/ML workloads where fast recovery is critical, this overhead is significant.

The Solution: In-Place Container Restarts¶

Kubernetes has introduced capabilities that allow containers to restart without recreating the Pod:

KEP-5307: Container Restart Policy (Kubernetes 1.34)¶

KEP-5307 introduces fine-grained control over restart behavior for individual containers within a Pod. This enables:

Specifying restart policies per container (not just per Pod)
Triggering container restarts without affecting the entire Pod
Preserving Pod identity, IP, and volumes during restarts

KEP-5532: Restart All Containers on Container Exit (Kubernetes 1.35)¶

KEP-5532 extends this capability to coordinated restarts:

Restarting all containers in a Pod when a specific container exits
Restarting init containers and sidecars as part of the Pod lifecycle
Enabling Pod-level restart coordination without Pod recreation

Real-World Results: JobSet In-Place Restarts¶

The JobSet team developed an in-place restart prototype that demonstrates dramatic performance improvements:

Metric	Traditional Restart	In-Place Restart	Improvement
Restart time	2 min 10 sec	10 sec	92% faster
Test scale	5,000 nodes	5,000 nodes	–
Scheduling overhead	High	None	Eliminated
Pod recreation	Required	Not required	Avoided

For detailed design information, see the JobSet in-place restart design document.

Why This Matters for AI Workloads¶

1. Distributed Training Recovery¶

Large-scale distributed training jobs (PyTorch DDP, TensorFlow MultiWorkerMirroredStrategy) are especially sensitive to restart latency:

Checkpoint recovery: After a failure, all workers must restart from the latest checkpoint. In-place restarts make worker recovery 12× faster.
Gradient synchronization: Training can only proceed when all workers are running. Faster restarts mean less wasted GPU time.
Cost savings: On expensive GPU clusters ($2–10 per GPU-hour), saving 2 minutes per restart quickly adds up.

2. Job Dependencies¶

Many AI pipelines have complex job dependencies. When a job restarts:

Downstream jobs wait for upstream completion
Gang scheduling constraints require all workers to be present
Network connections must be preserved for collective operations

In-place restarts preserve Pod identity and network connections, minimizing disruption to the overall pipeline.

3. Resource Efficiency¶

Traditional restarts involve:

Scheduler load: Finding nodes for potentially thousands of Pods
API server load: Creating and deleting Pod objects
Node preparation: Image pulls, volume mounts, init containers

In-place restarts eliminate all of this overhead, reserving resources for actual workloads.

How It Works¶

Before: Traditional Restart Flow¶

Trigger job restart
    ↓
Delete all Pods → wait for termination (30s+)
    ↓
Create new Pods → wait for scheduling (30s+)
    ↓
Pull images (if needed) → start containers (60s+)
    ↓
Total: ~2 min 10 sec
````

### After: In-Place Restart Flow

```text
Trigger job restart
    ↓
Send container exit signal → containers restart in place (10s)
    ↓
Total: ~10 sec

Key differences:

No Pod deletion: Pod objects are preserved, maintaining identity
No re-scheduling: Pods remain on their current nodes
No image pulls: Images are already cached on the node
Immediate restart: Container processes restart directly

Implementation Considerations¶

When to Use In-Place Restarts¶

Transient failures: Container crashes, OOM kills, network timeouts
Configuration updates: Restarting to pick up new environment variables
Checkpoint recovery: Resuming training from saved state
Rolling restarts: Gracefully restarting workers in sequence

When Traditional Restarts Are Required¶

Node failures: Pods must move to healthy nodes
Resource changes: Pods need more or fewer resources (consider VPA)
Image updates: A new container image is required
Topology changes: Pods need different placement

Integrating with JobSet¶

JobSet can leverage in-place restarts as follows:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: distributed-training
spec:
  replicatedJobs:
  - name: workers
    replicas: 8
    template:
      spec:
        template:
          spec:
            restartPolicy: Always  # Enable in-place restarts
            containers:
            - name: trainer
              image: pytorch/pytorch:latest

The Broader Co-evolution Pattern¶

This JobSet improvement is a classic example of co-evolution in cloud-native AI:

Kubernetes Capability	Project Adoption	Benefit
In-place restart	JobSet	92% faster recovery
Gang scheduling (1.35)	Kueue, LWS	All-or-nothing placement
DRA (1.34 GA)	NVIDIA GPU Operator	Flexible device allocation
Workload API (1.35)	Volcano, YuniKorn	Native workload support

As Kubernetes continues to add AI-friendly features, we expect more projects to adopt them, creating a virtuous cycle of improvement.

Getting Started¶

Prerequisites¶

Kubernetes 1.34+ (for KEP-5307)
Kubernetes 1.35+ (for KEP-5532 Pod-level restarts)
A JobSet version that supports in-place restarts (check the latest release)

Enable Feature Gates¶

# Enable KEP-5307 (Container Restart Policy, 1.34+) on kubelet
--feature-gates=ContainerRestartPolicy=true

# Enable KEP-5532 (Restart All Containers, 1.35+) on kubelet
--feature-gates=RestartAllContainersOnContainerExits=true

Test In-Place Restarts¶

Deploy a JobSet with restartPolicy: Always
Trigger a container restart (e.g., kubectl exec ... -- kill -TERM 1)
Observe the restart time compared to Pod recreation

Future Roadmap¶

In-place restart capabilities continue to evolve:

KEP-5307 graduation: Moving toward Beta/GA
KEP-5532 enhancements: More robust Pod-level restart control
JobSet integration: Native support for in-place restart policies
Observability: Better visibility into restart events
Kueue integration: Workload-aware restart handling

Conclusion¶

The JobSet in-place restart optimization showcases the power of co-evolution in the Kubernetes ecosystem. By adopting upstream Kubernetes capabilities, projects can achieve significant performance gains:

92% faster restarts (2 min 10 sec → 10 sec)
Zero scheduling overhead
Preserved Pod identity and networking
Reduced API server load

This is just one example of how the Kubernetes community and downstream projects collaborate to improve AI workload efficiency. As more AI-related features land in Kubernetes, we can expect JobSet, Ray, LWS, and others to deliver even more optimizations.

Kubernetes x JobSet: How Co-evolution Makes AI Job Restarts 10× Faster¶

The Problem: Slow JobSet Restarts¶

The Solution: In-Place Container Restarts¶

KEP-5307: Container Restart Policy (Kubernetes 1.34)¶

KEP-5532: Restart All Containers on Container Exit (Kubernetes 1.35)¶

Real-World Results: JobSet In-Place Restarts¶

Why This Matters for AI Workloads¶

1. Distributed Training Recovery¶

2. Job Dependencies¶

3. Resource Efficiency¶

How It Works¶

Before: Traditional Restart Flow¶

Implementation Considerations¶

When to Use In-Place Restarts¶

When Traditional Restarts Are Required¶

Integrating with JobSet¶

The Broader Co-evolution Pattern¶

Getting Started¶

Prerequisites¶

Enable Feature Gates¶

Test In-Place Restarts¶

Future Roadmap¶

Conclusion¶

References¶

KEPs and Documentation¶

Comments

Kubernetes x JobSet: How Co-evolution Makes AI Job Restarts 10× Faster¶

The Problem: Slow JobSet Restarts¶

The Solution: In-Place Container Restarts¶

KEP-5307: Container Restart Policy (Kubernetes 1.34)¶

KEP-5532: Restart All Containers on Container Exit (Kubernetes 1.35)¶

Real-World Results: JobSet In-Place Restarts¶

Why This Matters for AI Workloads¶

1. Distributed Training Recovery¶

2. Job Dependencies¶

3. Resource Efficiency¶

How It Works¶

Before: Traditional Restart Flow¶

Implementation Considerations¶

When to Use In-Place Restarts¶

When Traditional Restarts Are Required¶

Integrating with JobSet¶

The Broader Co-evolution Pattern¶

Getting Started¶

Prerequisites¶

Enable Feature Gates¶

Test In-Place Restarts¶

Future Roadmap¶

Conclusion¶

References¶

KEPs and Documentation¶

Related Projects¶

Comments