NUMA Affinity Scheduling¶
A NUMA node is a fundamental unit in a Non-Uniform Memory Access (NUMA) architecture. A Kubernetes Node consists of multiple NUMA nodes. Accessing memory across different NUMA nodes introduces latency. Developers can optimize task scheduling and memory allocation strategies to improve memory access efficiency and overall performance.
Use Cases¶
NUMA affinity scheduling is commonly used for compute-intensive jobs sensitive to CPU parameters or scheduling latency, such as scientific computing, video decoding, animation rendering, and big data offline processing.
Scheduling Policies¶
Pod scheduling can adopt the following NUMA placement policies. See the Pod scheduling behavior documentation for details on each policy’s effect:
- single-numa-node: Pods are scheduled only on nodes whose topology manager policy is set to
single-numa-node
, with CPUs allocated within the same NUMA node. If no node meets this condition, the Pod will not be scheduled. - restricted: Pods are scheduled on nodes with topology manager policy
restricted
, with CPUs allocated within the same set of NUMA nodes. If no node satisfies this, scheduling fails. - best-effort: Pods are scheduled on nodes with topology manager policy
best-effort
, attempting to place CPUs within the same NUMA node if possible. If no node fully meets this, the best available node is chosen.
Scheduling Principle¶
When a Pod specifies a topology policy, Volcano filters nodes according to the policy and CPU topology:
- Filter nodes based on the Pod’s Volcano topology policy.
- Further filter nodes whose CPU topology meets the policy requirements for scheduling.
Pod Topology Policy | Step 1: Filter Nodes by Topology Policy | Step 2: CPU Topology Filter and Scheduling Behavior |
---|---|---|
none | All nodes allowed: none, best-effort, restricted, single-numa-node | No CPU topology filtering |
best-effort | Only nodes with best-effort policy allowed | Prefer single NUMA node allocation; if unavailable, allow multiple NUMA nodes to satisfy CPU request |
restricted | Only nodes with restricted policy allowed | Strict: If a single NUMA node has enough CPUs to satisfy the request, allocate within that NUMA node only. If insufficient, Pod is unschedulable. If no single NUMA node fits, multiple NUMA nodes allowed |
single-numa-node | Only nodes with single-numa-node policy allowed | CPU allocation strictly within a single NUMA node |
Configuring NUMA Affinity Scheduling¶
-
Set policies in the Job spec:
-
Configure kubelet’s topology manager policy by setting the
--topology-manager-policy
parameter. Supported values are:none
(default)best-effort
restricted
single-numa-node
Usage Examples¶
-
Example 1: Configure NUMA affinity in a stateless workload.
kind: Deployment apiVersion: apps/v1 metadata: name: numa-test spec: replicas: 1 selector: matchLabels: app: numa-test template: metadata: labels: app: numa-test annotations: volcano.sh/numa-topology-policy: single-numa-node # set topology policy spec: containers: - name: container-1 image: nginx:alpine resources: requests: cpu: 2 # must be an integer and match limits memory: 2048Mi limits: cpu: 2 # must be an integer and match requests memory: 2048Mi imagePullSecrets: - name: default-secret
-
Example 2: Create a Volcano Job using NUMA affinity.
apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: vj-test spec: schedulerName: volcano minAvailable: 1 tasks: - replicas: 1 name: "test" topologyPolicy: best-effort # set topology policy template: spec: containers: - image: alpine command: ["/bin/sh", "-c", "sleep 1000"] imagePullPolicy: IfNotPresent name: running resources: limits: cpu: 20 memory: "100Mi" restartPolicy: OnFailure
NUMA Scheduling Analysis¶
Assuming the following NUMA node setup:
Node | Node Topology Manager Policy | Allocatable CPUs on NUMA node 0 | Allocatable CPUs on NUMA node 1 |
---|---|---|---|
node-1 | single-numa-node | 16 CPUs | 16 CPUs |
node-2 | best-effort | 16 CPUs | 16 CPUs |
node-3 | best-effort | 20 CPUs | 20 CPUs |
- In Example 1, the Pod requests 2 CPUs and sets the policy to
single-numa-node
, so it will be scheduled onnode-1
. - In Example 2, the Pod requests 20 CPUs and sets the policy to
best-effort
. It will be scheduled onnode-3
becausenode-3
can fulfill the CPU request on a single NUMA node, whilenode-2
would need to allocate across multiple NUMA nodes.
Check Current Node CPU Info¶
Use lscpu
to view CPU and NUMA node information:
Check Current CPU Allocation¶
Check CPU allocation on the node:
cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0,10-15,25-31","entries":{"777870b5-c64f-42f5-9296-688b9dc212ba":{"container-1":"16-24"},"fb15e10a-b6a5-4aaa-8fcd-76c1aa64e6fd":{"container-1":"1-9"}},"checksum":318470969}
In this example, two containers are running on the node: one using CPUs 1-9 on NUMA node 0, and the other using CPUs 16-24 on NUMA node 1.