Skip to content

Kubernetes x JobSet: How Co-evolution Makes AI Job Restarts 10× Faster

In the fast-moving world of AI infrastructure, a powerful synergy is emerging: the Kubernetes community develops core capabilities, while downstream projects such as JobSet, Ray, and LeaderWorkerSet (LWS) adopt these features to deliver dramatic efficiency gains. We call this co-evolution—the entire ecosystem moving forward together.

Kubernetes has recently introduced a growing set of AI-related capabilities. However, to fully unlock their potential for AI workloads, other projects must adapt to them. Today, we explore a representative example:

JobSet achieves a 92% restart speed improvement by leveraging Kubernetes in-place container restarts.

The Problem: Slow JobSet Restarts

When a distributed training job running on JobSet needs to restart—due to transient failures, configuration updates, or checkpoint recovery—the traditional approach involves:

  1. Deleting all Pods in the JobSet
  2. Waiting for Pod termination to complete
  3. Re-scheduling all Pods via the Kubernetes scheduler
  4. Waiting for Pods to start (including image pulls, init containers, etc.)

In a large-scale cluster with 5,000 nodes, this process takes about 2 minutes and 10 seconds. For AI/ML workloads where fast recovery is critical, this overhead is significant.

The Solution: In-Place Container Restarts

Kubernetes has introduced capabilities that allow containers to restart without recreating the Pod:

KEP-5307: Container Restart Policy (Kubernetes 1.34)

KEP-5307 introduces fine-grained control over restart behavior for individual containers within a Pod. This enables:

  • Specifying restart policies per container (not just per Pod)
  • Triggering container restarts without affecting the entire Pod
  • Preserving Pod identity, IP, and volumes during restarts

KEP-5532: Restart All Containers on Container Exit (Kubernetes 1.35)

KEP-5532 extends this capability to coordinated restarts:

  • Restarting all containers in a Pod when a specific container exits
  • Restarting init containers and sidecars as part of the Pod lifecycle
  • Enabling Pod-level restart coordination without Pod recreation

Real-World Results: JobSet In-Place Restarts

The JobSet team developed an in-place restart prototype that demonstrates dramatic performance improvements:

Metric Traditional Restart In-Place Restart Improvement
Restart time 2 min 10 sec 10 sec 92% faster
Test scale 5,000 nodes 5,000 nodes
Scheduling overhead High None Eliminated
Pod recreation Required Not required Avoided

For detailed design information, see the JobSet in-place restart design document.

Why This Matters for AI Workloads

1. Distributed Training Recovery

Large-scale distributed training jobs (PyTorch DDP, TensorFlow MultiWorkerMirroredStrategy) are especially sensitive to restart latency:

  • Checkpoint recovery: After a failure, all workers must restart from the latest checkpoint. In-place restarts make worker recovery 12× faster.
  • Gradient synchronization: Training can only proceed when all workers are running. Faster restarts mean less wasted GPU time.
  • Cost savings: On expensive GPU clusters ($2–10 per GPU-hour), saving 2 minutes per restart quickly adds up.

2. Job Dependencies

Many AI pipelines have complex job dependencies. When a job restarts:

  • Downstream jobs wait for upstream completion
  • Gang scheduling constraints require all workers to be present
  • Network connections must be preserved for collective operations

In-place restarts preserve Pod identity and network connections, minimizing disruption to the overall pipeline.

3. Resource Efficiency

Traditional restarts involve:

  • Scheduler load: Finding nodes for potentially thousands of Pods
  • API server load: Creating and deleting Pod objects
  • Node preparation: Image pulls, volume mounts, init containers

In-place restarts eliminate all of this overhead, reserving resources for actual workloads.

How It Works

Before: Traditional Restart Flow

Trigger job restart
Delete all Pods → wait for termination (30s+)
Create new Pods → wait for scheduling (30s+)
Pull images (if needed) → start containers (60s+)
Total: ~2 min 10 sec
````

### After: In-Place Restart Flow

```text
Trigger job restart
Send container exit signal → containers restart in place (10s)
Total: ~10 sec

Key differences:

  1. No Pod deletion: Pod objects are preserved, maintaining identity
  2. No re-scheduling: Pods remain on their current nodes
  3. No image pulls: Images are already cached on the node
  4. Immediate restart: Container processes restart directly

Implementation Considerations

When to Use In-Place Restarts

  • Transient failures: Container crashes, OOM kills, network timeouts
  • Configuration updates: Restarting to pick up new environment variables
  • Checkpoint recovery: Resuming training from saved state
  • Rolling restarts: Gracefully restarting workers in sequence

When Traditional Restarts Are Required

  • Node failures: Pods must move to healthy nodes
  • Resource changes: Pods need more or fewer resources (consider VPA)
  • Image updates: A new container image is required
  • Topology changes: Pods need different placement

Integrating with JobSet

JobSet can leverage in-place restarts as follows:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: distributed-training
spec:
  replicatedJobs:
  - name: workers
    replicas: 8
    template:
      spec:
        template:
          spec:
            restartPolicy: Always  # Enable in-place restarts
            containers:
            - name: trainer
              image: pytorch/pytorch:latest

The Broader Co-evolution Pattern

This JobSet improvement is a classic example of co-evolution in cloud-native AI:

Kubernetes Capability Project Adoption Benefit
In-place restart JobSet 92% faster recovery
Gang scheduling (1.35) Kueue, LWS All-or-nothing placement
DRA (1.34 GA) NVIDIA GPU Operator Flexible device allocation
Workload API (1.35) Volcano, YuniKorn Native workload support

As Kubernetes continues to add AI-friendly features, we expect more projects to adopt them, creating a virtuous cycle of improvement.

Getting Started

Prerequisites

  • Kubernetes 1.34+ (for KEP-5307)
  • Kubernetes 1.35+ (for KEP-5532 Pod-level restarts)
  • A JobSet version that supports in-place restarts (check the latest release)

Enable Feature Gates

# Enable KEP-5307 (Container Restart Policy, 1.34+) on kubelet
--feature-gates=ContainerRestartPolicy=true

# Enable KEP-5532 (Restart All Containers, 1.35+) on kubelet
--feature-gates=RestartAllContainersOnContainerExits=true

Test In-Place Restarts

  1. Deploy a JobSet with restartPolicy: Always
  2. Trigger a container restart (e.g., kubectl exec ... -- kill -TERM 1)
  3. Observe the restart time compared to Pod recreation

Future Roadmap

In-place restart capabilities continue to evolve:

  • KEP-5307 graduation: Moving toward Beta/GA
  • KEP-5532 enhancements: More robust Pod-level restart control
  • JobSet integration: Native support for in-place restart policies
  • Observability: Better visibility into restart events
  • Kueue integration: Workload-aware restart handling

Conclusion

The JobSet in-place restart optimization showcases the power of co-evolution in the Kubernetes ecosystem. By adopting upstream Kubernetes capabilities, projects can achieve significant performance gains:

  • 92% faster restarts (2 min 10 sec → 10 sec)
  • Zero scheduling overhead
  • Preserved Pod identity and networking
  • Reduced API server load

This is just one example of how the Kubernetes community and downstream projects collaborate to improve AI workload efficiency. As more AI-related features land in Kubernetes, we can expect JobSet, Ray, LWS, and others to deliver even more optimizations.

The future of AI infrastructure is co-evolution—and it’s already happening.

References

KEPs and Documentation

Comments