Skip to content

Kubernetes 1.28 Shockingly Released, Sidecar Containers are Coming

Announcing the release of Kubernetes v1.28 Planternetes, the second release of 2023!

This release consists of 45 enhancements. Of those enhancements, 19 are entering Alpha, 14 have graduated to Beta, and 12 have graduated to Stable.

Kubernetes v1.28: Planternetes

The theme for Kubernetes v1.28 is Planternetes.

planternetes

Each Kubernetes release is the culmination of the hard work of thousands of individuals from our community. The people behind this release come from a wide range of backgrounds, some of us industry veterans, parents, others students and newcomers to open-source. We combine our unique experience to create a collective artifact with global impact.

Much like a garden, our release has ever-changing growth, challenges and opportunities. This theme celebrates the meticulous care, intention and efforts to get the release to where we are today. Harmoniously together, we grow better.

What's New (Major Themes)

Changes to supported skew between control plane and node versions

This enables testing and expanding the supported skew between core node and control plane components by one version from n-2 to n-3, so that node components (kubelet and kube-proxy) for the oldest supported minor version work with control plane components (kube-apiserver, kube-scheduler, kube-controller-manager, cloud-controller-manager) for the newest supported minor version.

This is valuable for end users as control plane upgrade will be a little faster than node upgrade, which are almost always going to be the longer with running workloads.

The Kubernetes yearly support period already makes annual upgrades possible. Users can upgrade to the latest patch versions to pick up security fixes and do 3 sequential minor version upgrades once a year to "catch up" to the latest supported minor version.

However, since the tested/supported skew between nodes and control planes is currently limited to 2 versions, a 3-version upgrade would have to update nodes twice to stay within the supported skew.

Generally available: recovery from non-graceful node shutdown

If a node shuts down unexpectedly or ends up in a non-recoverable state (perhaps due to hardware failure or unresponsive OS), Kubernetes allows you to clean up afterwards and allow StatefulSets to restart on a different node. For Kubernetes v1.28, that's now a stable feature.

This allows StatefulSets to failover to a different node successfully after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS.

Versions of Kubernetes earlier than v1.20 lacked handling for node shutdown on Linux, the kubelet integrates with systemd and implements graceful node shutdown (beta, and enabled by default). However, even an intentional shutdown might not get handled well that could be because:

  • the node runs Windows
  • the node runs Linux, but uses a different init (not systemd )
  • the shutdown does not trigger the system inhibitor locks mechanism
  • because of a node-level configuration error (such as not setting appropriate values for shutdownGracePeriod and shutdownGracePeriodCriticalPods ).

When a node shutdowns or fails, and that shutdown was not detected by the kubelet, the pods that are part of a StatefulSet will be stuck in terminating status on the shutdown node. If the stopped node restarts, the kubelet on that node can clean up ( DELETE ) the Pods that the Kubernetes API still sees as bound to that node. However, if the node stays stopped - or if the kubelet isn't able to start after a reboot - then Kubernetes may not be able to create replacement Pods. When the kubelet on the shut-down node is not available to delete the old pods, an associated StatefulSet cannot create a new pod (which would have the same name).

There's also a problem with storage. If there are volumes used by the pods, existing VolumeAttachments will not be disassociated from the original - and now shut down - node so the PersistentVolumes used by these pods cannot be attached to a different, healthy node. As a result, an application running on an affected StatefulSet may not be able to function properly. If the original, shut down node does come up, then their pods will be deleted by its kubelet and new pods can be created on a different running node. If the original node does not come up (common with an immutable infrastructure design), those pods would be stuck in a Terminating status on the shut-down node forever.

For more information on how to trigger cleanup after a non-graceful node shutdown, read non-graceful node shutdown.

Improvements to CustomResourceDefinition validation rules

The Common Expression Language (CEL) can be used to validate custom resources. The primary goal is to allow the majority of the validation use cases that might once have needed you, as a CustomResourceDefinition (CRD) author, to design and implement a webhook. Instead, and as a beta feature, you can add validation expressions directly into the schema of a CRD.

CRDs need direct support for non-trivial validation. While admission webhooks do support CRDs validation, they significantly complicate the development and operability of CRDs.

For more information, read validation rules in the CRD documentation.

ValidatingAdmissionPolicies graduate to beta

Common Expression language for admission control is customizable, in-process validation of requests to the Kubernetes API server as an alternative to validating admission webhooks.

This builds on the capabilities of the CRD Validation Rules feature that graduated to beta in 1.25 but with a focus on the policy enforcement capabilities of validating admission control.

This will lower the infrastructure barrier to enforcing customizable policies as well as providing primitives that help the community establish and adhere to the best practices of both K8s and its extensions.

To use ValidatingAdmissionPolicies, you need to enable the admissionregistration.k8s.io/v1beta1 API group in your cluster's control plane.

Match conditions for admission webhooks

Kubernetes v1.27 lets you specify match conditions for admission webhooks, which lets you narrow the scope of when Kubernetes makes a remote HTTP call at admission time. The matchCondition field for ValidatingWebhookConfiguration and MutatingWebhookConfiguration is a CEL expression that must evaluate to true for the admission request to be sent to the webhook.

In Kubernetes v1.28, that field moved to beta, and it's enabled by default.

To learn more, see matchConditions in the Kubernetes documentation.

Beta support for enabling swap space on Linux

This adds swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.

There are two distinct types of users for swap, who may overlap:

  • Node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbor issues.

  • Application developers, who have written applications that would benefit from using swap memory.

Mixed version proxy (alpha)

When a cluster has multiple API servers at mixed versions (such as during an upgrade/downgrade or when runtime-config changes and a rollout happens), not every apiserver can serve every resource at every version.

For Kubernetes v1.28, you can enable the mixed version proxy within the API server's aggregation layer. The mixed version proxy finds requests that the local API server doesn't recognize but another API server inside the control plan is able to support. Having found a suitable peer, the aggregation layer proxies the request to a compatible API server; this is transparent from the client's perspective.

When an upgrade or downgrade is performed on a cluster, for some period of time the API servers within the control plane may be at differing versions; when that happens, different subsets of the API servers are able to serve different sets of built-in resources (different groups, versions, and resources are all possible). This new alpha mechanism lets you hide that skew from clients.

Source code reorganization for control plane components

Kubernetes contributors have begun to reorganize the code for the kube-apiserver to build on a new staging repository that consumes k/apiserver but has a bigger, carefully chosen subset of the functionality of kube-apiserver such that it is reusable.

This is a gradual reorganization; eventually there will be a new git repository with generic functionality abstracted from Kubernetes' API server.

Support for CDI injection into containers (alpha)

CDI provides a standardized way of injecting complex devices into a container (i.e. devices that logically require more than just a single /dev node to be injected for them to work). This new feature enables plugin developers to utilize the CDIDevices field added to the CRI in 1.27 to pass CDI devices directly to CDI enabled runtimes (of which containerd and crio-o are in recent releases).

API awareness of sidecar containers (alpha)

Kubernetes 1.28 introduces an alpha restartPolicy field for init containers, and uses that to indicate when an init container is also a sidecar container. The will start init containers with restartPolicy: Always in the order they are defined, along with other init containers. Instead of waiting for that sidecar container to complete before starting the main container(s) for the Pod, the kubelet only waits for the sidecar init container to have started.

The condition for startup completion will be that the startup probe succeeded (or if no startup probe is defined) and postStart handler is completed. This condition is represented with the field Started of ContainerStatus type. See the section "Pod startup completed condition" for considerations on picking this signal.

For init containers, you can either omit the restartPolicy field, or set it to Always . Omitting the field means that you want a true init container that runs to completion before application startup.

Sidecar containers do not block Pod completion: if all regular containers are complete, sidecar containers in that Pod will be terminated.

For sidecar containers, the restart behavior is more complex than for init containers. In a Pod with restartPolicy set to Never , a sidecar container that fails during Pod startup will not be restarted and the whole Pod is treated as having failed. If the Pod's restartPolicy is Always or OnFailure , a sidecar that fails to start will be retried.

Once the sidecar container has started (process running, postStart was successful, and any configured startup probe is passing), and then there's a failure, that sidecar container will be restarted even when the Pod's overall restartPolicy is Never or OnFailure . Furthermore, sidecar containers will be restarted (on failure or on normal exit) even during Pod termination.

To learn more, read API for sidecar containers.

Automatic, retroactive assignment of a default StorageClass graduates to stable

This feature makes it easier to change the default StorageClass by allowing the default storage class assignment to be retroactive for existing unbound persistent volume claims without any storage class assigned.

This changes the behavior of default storage class assignment to be retroactive for existing unbound persistent volume claims without any storage class assigned. This changes the existing Kubernetes behavior slightly, which is further described in the sections below.

Pod replacement policy for Jobs (alpha)

Kubernetes 1.28 adds a new field for the Job API that allows you to specify if you want the control plane to make new Pods as soon as the previous Pods begin termination (existing behavior), or only once the existing pods are fully terminated (new, optional behavior).

Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per index. With the older behaviour, if a pod that belongs to an Indexed Job enters a terminating state (due to preemption, eviction or other external factors), a replacement pod is created but then immediately fails to start due to the clash with the old pod that has not yet shut down.

Having a replacement Pod appear before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets. These resources can be difficult to obtain so pods may only be able to find nodes once the existing pods have been terminated. If cluster autoscaler is enabled, early creation of replacement Pods might produce undesired scale-ups.

To learn more, read Delayed creation of replacement pods in the Job documentation.

Job retry backoff limit, per index (alpha)

This extends the Job API to support indexed jobs where the backoff limit is per index, and the Job can continue execution despite some of its indexes failing.

Currently, the indexes of an indexed job share a single backoff limit. When the job reaches this shared backoff limit, the job controller marks the entire job as failed, and the resources are cleaned up, including indexes that have yet to run to completion.

As a result, the existing implementation did not cover the situation where the workload is truly embarrassingly parallel: each index is fully independent of other indexes.

For instance, if indexed jobs were used as the basis for a suite of long-running integration tests, then each test run would only be able to find a single test failure.

For more information, read Handling Pod and container failures in the Kubernetes documentation.

CRI container and pod statistics without cAdvisor

This encompasses two related pieces of work (changes to the kubelet's /metrics/cadvisor endpoint and improvements to the replacement summary API).

There are two main APIs that consumers use to gather stats about running containers and pods: summary API and /metrics/cadvisor . The Kubelet is responsible for implementing the summary API, and cadvisor is responsible for fulfilling /metrics/cadvisor .

This enhances CRI implementations to be able to fulfill all the stats needs of Kubernetes. At a high level, there are two pieces of this:

  • It enhances the CRI API with enough metrics to supplement the pod and container fields in the summary API directly from CRI.

  • It enhances the CRI implementations to broadcast the required metrics to fulfill the pod and container fields in the /metrics/cadvisor endpoint.

Feature graduations and deprecations in Kubernetes v1.28

Graduations to stable

This release includes a total of 12 enhancements promoted to Stable:

Deprecations and removals

Removals:

Deprecations:

Release Notes

The complete details of the Kubernetes v1.28 release are available in our release notes.

Availability

Kubernetes v1.28 is available for download on GitHub. To get started with Kubernetes, you can run local Kubernetes clusters using minikube, kind, etc. You can also easily install v1.28 using kubeadm.

DaoCloud Community Contributions

In this release, DaoCloud made significant contributions to sig-node, sig-scheduling, sig-storage, and kubeadm. Here are the specific functional points:

  • [API] kube-apiserver no longer allows users to re-enable API resources belonging to the deprecated policy/v1beta1 group.
  • [Networking] Added warnings for conflicting ports in the update/patch of Pod container ports.
  • [Networking] Introduced Alpha support for the "status.hostIPs" field in the Pod API, along with downward API support. To use either of these, the Alpha feature gate "PodHostIPs" needs to be enabled.
  • [Scheduling] When the scheduler executes the Reserve plugin and the plugin's scheduling result is not schedulable, the scheduler records the name of the plugin in the list of failed plugins. This allows for re-evaluation when relevant resource events are observed to determine whether the Pod can be placed in the schedulable queue.
  • [Kubeadm] crictl pull should use the -i flag to set the endpoint of the container image service.
  • [Apps] When the Cronjob controller fails to coordinate the creation of Job resources due to a Terminating namespace status, an error is returned instead of retrying.
  • [Apps] Fixed the issue where the kubectl create job test-job --from=cronjob/a-cronjob command doesn't update the Cronjob status.lastSuccessfulTime .
  • [Apps] Counting metrics for all pod deletion actions with force_delete_pods_total and force_delete_pod_errors_total indicators.
  • [Storage] Deprecated support for CSI migration of Ceph RBD volumes.
  • [Instrumentation] Implemented structured and contextual logging in kube-controller-manager and kube-scheduler.

During the v1.28 release, DaoCloud made contributions to nearly a hundred issue fixes and feature developments, with around 98 commits from 15 contributors out of over 200 contributors in total. For more details, please refer to the contribution list.

DaoCloud's developers achieved significant milestones during the Kubernetes v1.28 release cycle:

  • Li Xin translated many recent official website blogs and documents and became the maintainer of SIG-Docs-Zh.
  • DaoCloud has two other Chinese documentation maintainers, windsonsea and Mengjiao.Liu.
  • Yin Na became an Approver for SIG-scheduler.
  • Xu Junjie became an Approver for the kubeadm project.
  • Cai Wei became a Reviewer for the Containerd community and also organized and hosted KCD Beijing 2023. At KCD Beijing, Zhang Shiming presented on the community hot topic project KWOK (Kubernetes WithOut Kubelet) with a focus on "Low-Cost Cluster Simulation". Jiang Xingyan, as a co-speaker, shared the theme of "Karmada Cross-Cluster Scaling Scenarios and Implementation Analysis" in another session.

Kubecon 2023

At the upcoming KubeCon China 2023 global summit in Shanghai, DaoCloud will present over a dozen topics related to cloud native, including several maintainer sessions, cloud native practices in AI, and implementations in multi-cloud and financial automotive industries. Stay tuned for more details.

Liu Qijun, Pan Yuanhang, and Xu Junjie were selected as members of the 2023 KubeCon Program Committee. Meanwhile, Han Xiaopeng was also invited to participate in the IstioCon Program Committee.

Open Source Booth

Clusterpedia (https://clusterpedia.io)

The name is inspired by Wikipedia, and it embodies the core concept of Clusterpedia as a multi-cluster encyclopedia. By aggregating resources from multiple clusters and providing more powerful search capabilities on top of the Kubernetes OpenAPI compatibility, users can quickly and conveniently access any desired resource in a multi-cluster environment. Of course, Clusterpedia's capabilities go beyond just search and view. In the future, it will also support simple resource control, just like how wiki supports editing entries.

Merbridge (https://merbridge.io)

Merbridge is designed for service mesh and uses eBPF to replace traditional iptables for intercepting and forwarding traffic, making the traffic interception and forwarding capabilities of service mesh more efficient.

Hwameistor (https://hwameistor.io)

Hwameistor is a Kubernetes-native Container Attached Storage (CAS) solution that manages HDD, SSD, and NVMe disks as a unified local storage resource pool. It provides distributed local data volume services using the CSI architecture, enabling data persistence for stateful cloud native applications or components.

Booths for Containerd, Kubespray, Piraeus, SIG-Scheduling, SIG-Node, and more will also have DaoCloud members on-site to answer questions and engage in discussions. We welcome everyone to come and exchange ideas.

Keynote Speeches

  • On 27th September at 11:00 AM, Kay Yan, Kante Yin, Alex Zheng, and XingYan Jiang will deliver the following keynote speeches:

    • "SIG Cluster Lifecycle: What's New in Kubespray" by Kay Yan from DaoCloud and Peng Liu from Jinan Inspur Data Technology Co, Ltd.\ Topic details: Link\ Room: Maintainer Track | 3M5A Meeting Room on the 3rd floor.

    • "Sailing Ray Workloads with KubeRay and Kueue in Kubernetes" by Jason Hu from Volcano Engine and Kante Yin from DaoCloud\ Topic details: Link\ Room: 3F ROOM 307

    • "Best Practice for Cloud Native Storage in Finance Industry" by Yang Cao and Jie Chu from Shanghai Pudong Development Bank, and Alex Zheng from DaoCloud\ Topic details: Link\ Room: 2F ROOM 2

    • "Break Through Cluster Boundaries to Autoscale Workloads Across Them on a Large Scale" by Wei Jiang from Huawei Cloud and XingYan Jiang from DaoCloud\ Topic details: Link\ Room: 3F ROOM 305A

  • On 27th September at 2:45 PM, Iceber Gu will deliver the following keynote speech:

    • "Project Update and Deep Dive: Containerd" by Wei Fu from Microsoft and Iceber Gu from DaoCloud\ Topic details: Link\ Room: Maintainer Track | 3M3 Meeting Room on the 3rd floor.
  • On 27th September from 2:55 to 3:00 PM, Peter Pan will deliver the following lightning talk:

    • "Lightning Talk: Sailing Kubernetes Operation with AI Power" by Peter Pan from DaoCloud\ Topic details: Link\ Room: 3F ROOM 302
  • On 27th September at 3:50 PM, Mengjiao Liu and Kante Yin will deliver the following keynote speeches:

    • "SIG Instrumentation: Intro, Deep Dive and Recent Developments" by Mengjiao Liu from DaoCloud and Shivanshu Raj Shrivastava from Tetrate\ Topic details: Link\ Room: Maintainer Track | 3M3 Meeting Room on the 3rd floor.

    • "SIG-Scheduling Intro & Deep Dive" by Qingcan Wang from Shopee and Kante Yin from DaoCloud\ Topic details: Link\ Room: 3F ROOM 307

  • On 28th September at 11:00 AM, Kante Yin will deliver the following keynote speech:

    • "Intro and Deep Dive into SIG Network" by Kante Yin from DaoCloud\ Topic details: Link\ Room: Maintainer Track | 3M5B Meeting Room on the 3rd floor.
  • On 28th September at 2:45 PM, Yubin Yao will deliver the following keynote speech:

    • "Using Pulsar to Build Event-driven Microservices in Kubernetes" by Yubin Yao from DaoCloud\ Topic details: Link\ Room: 3F ROOM 305A

Comments