Skip to content

FAQs

This page lists some frequently asked questions that may arise in container management, providing you convenience to perform troubleshooting.

  1. Helm application installation failed with "OOMKilled" error message

    failure case

    As shown in the figure, the container management module will automatically create and launch a Job responsible for installing the specific application. In version v0.6.0, due to unreasonable job resource settings, OOM was caused, affecting application installation. This bug has been fixed in version 0.6.1. If you upgrade to the environment of v0.6.1, it will only take effect in new created or accessed clusters. Existing clusters need to be manually adjusted to take effect.

    Click to check how to adjust script
    • The following scripts are executed in the global service cluster
    • Find the corresponding cluster, take skoala-dev as an example in this article, and obtain the corresponding skoala-dev-setting configmap.
    • After updating the configmap, it will take effect.

      kubectl get cm -n kpanda-system skoala-dev-setting -o yaml
      apiVersion: v1
      data:
      clusterSetting: '{"plugins":[{"name":1,"intelligent_detection":true},{"name":2,"enabled":true,"intelligent_detection":true},{"name":3},{"name":6,"intelligent_detection":true},{"name":7,"intelligent_detection":true},{"name":8,"intelligent_detection":true},{"name":9,"intelligent_detection":true}],"network":[{"name":4,"enabled":true,"intelligent_detection":true},{"name":5,"intelligent_detection":true},{"name":10},{"name":11}],"addon_setting":{"helm_operation_history_limit":100,"helm_repo_refresh_interval":600,"helm_operation_base_image":"release-ci.daocloud.io/kpanda/kpanda-shell:v0.0.6","helm_operation_job_template_resources":{"limits":{"cpu":"50m","memory":"120Mi"},"requests":{"cpu":"50m","memory":"120Mi"}}},"clusterlcm_setting":{"enable_deletion_protection":true},"etcd_backup_restore_setting":{"base_image":"release.daocloud.io/kpanda/etcdbrctl:v0.22.0"}}'
      kind: ConfigMap
          metadata:
          labels:
              kpanda.io/cluster-plugins: ""
          name: skoala-dev-setting
          namespace: kpanda-system
          ownerReferences:
          - apiVersion: cluster.kpanda.io/v1alpha1
              blockOwnerDeletion: true
              controller: true
              kind: Cluster
              name: skoala-dev
              uid: f916e461-8b6d-47e4-906e-5e807bfe63d4
          uid: 8a25dfa9-ef32-46b4-bc36-b37b775a9632
      

      Modify clusterSetting -> helm_operation_job_template_resources to the appropriate value, and the value corresponding to version v0.6.1 is cpu: 100m,memory: 400Mi.

  2. Permission issues between the container management module and the global management module

    Users often ask why they can see a particular cluster or why they cannot see a specific cluster. How can we troubleshoot related permission issues? The permissions in the container management module are divided into cluster permissions and namespace permissions. If a user is bound, they can view the corresponding cluster and resources. For specific permission details, refer to the Cluster Permission Explanation.

    Container Management Permissions

    In the global management module, user authorization includes using the admin account, going to Global Management -> Users and Access Control -> Users menu, finding the corresponding user. In the Authorize User Group tab, if there are roles such as Admin, Kpanda Owner, etc., that have container management permissions, even if the cluster permissions or namespace permissions are not bound in the container management, they can see all clusters. For more information, refer to User Authorization Documentation

    Global Management User Authorization

    In the global management module, workspace binding involves using the account to go to Global Management -> Workspaces and Hierarchies, where you can see your authorized workspace. Click on the workspace name.

    1. If the workspace is authorized for you individually, you can see your account in the authorization tab, then check the resource group or shared resource tab. If the resource group is bound to a namespace or the shared resource is bound to a cluster, then your account can see the corresponding cluster.

    2. If you have been granted a global management-related role, you will not see your account in the authorization tab, and you will not be able to see the cluster resources bound to the workspace in the container management module.

    Global Management Workspace Binding

  3. When installing applications with Helm, unable to pull the kpanda-shell image

    After offline installation, connected clusters often encounter failures in pulling the kpanda-shell image, as shown in the image:

    Image Pull Failure

    In this case, simply go to the cluster management - cluster settings page, advanced configuration tab, and modify the Helm base image to a kpanda-shell image that can be successfully pulled by the cluster.

    Modify Image

  4. Helm Chart interface does not display the latest Chart uploaded to the corresponding Helm Repo, as shown in the image:

    Template

    In this case, simply refresh the corresponding Helm repository.

    Refresh Repository

  5. When an application installation with Helm fails and is stuck in installation, unable to delete the application for reinstallation, as shown in the image:

    Deletion Failure

    In this case, simply go to the custom resources page, find the helmreleases.helm.kpanda.io CRD, and then delete the corresponding helmreleases CR.

    Find CR

    Delete CR

  6. After removing node affinity and other scheduling strategies in Workloads, abnormal scheduling occurs, as shown in the image:

    Scheduling Abnormality

    In this case, it may be due to the incomplete removal of the strategy. Click on edit and delete all strategies.

    Edit

    Delete

    Normal Scheduling

  7. What is the logic behind Kcoral checking the Velero status of a working cluster?

    Detection

    • The working cluster has installed standard Velero components in the velero namespace.
    • The velero control plane, specifically the velero deployment, is in a running state and meets the expected replica count.
    • The velero data plane, specifically the node agent, is in a running state and meets the expected replica count.
    • Velero successfully connects to the target MinIO (BSL status is Available).
  8. When performing cross-cluster backup and restore with Kcoral, how does it determine the available clusters?

    When performing cross-cluster backup and restore applications with Kcoral, in the restore page, Kcoral will help filter the list of clusters capable of performing cross-cluster restore based on the following logic:

    Filter

    • Filtering out clusters that have not installed Velero.
    • Filtering out clusters with abnormal Velero status.
    • Obtaining a list of clusters that are connected to the same MinIO and Bucket as the target cluster and returning them.

    Therefore, as long as the clusters are connected to the same MinIO and Bucket, and Velero is in a running state, cross-cluster backup (requires write permission) and restore can be performed.

  9. After uninstalling VPA, HPA, CronHPA, why do the corresponding elastic scaling records still exist?

    Even though the components were uninstalled through the Helm Addon market, the related records in the application elastic scaling interface remain, as shown in the image:

    Edit

    This is a problem with helm uninstall, as it does not uninstall the corresponding CRD, causing residual data. In this case, we need to manually uninstall the corresponding CRD to complete the final cleanup.

  10. Why does the console fail to open on clusters with lower versions?

    In Kubernetes clusters with lower versions (below v1.18), opening the console results in a CSR resource request failure. When opening the console, the current logged-in user in the target cluster requests a certificate through the CSR resource. If the cluster version is too low or this functionality is not enabled in the controller, the certificate request fails, preventing connection to the target cluster.

    Refer to the certificate request process.

    Solution:

    • If the cluster version is greater than v1.18, check if kube-controller-manager has enabled the csr feature, ensuring the following controllers are normal:

      ttl-after-finished,bootstrapsigner,csrapproving,csrcleaner,csrsigning
      
    • The only solution for lower version clusters is to upgrade the version.

  11. How to reset a created cluster?

    Created clusters fall into two categories:

    • Clusters that failed to create: Clusters that failed to create due to parameter errors during the creation process can be retried by selecting retry on the failed installation and then reconfiguring the parameters for a new creation.
    • Successfully created clusters: These clusters can be uninstalled first, and then recreated. Uninstalling a cluster requires disabling cluster protection to uninstall the cluster.

    Disable Cluster Protection

    Uninstall Cluster

  12. Failure to install plugins when connecting to a cluster

    In offline environments connected clusters, before installing plugins, you need to configure the CRI proxy repository to ignore TLS verification (needs to be executed on all nodes).

    1. Modify the file /etc/docker/daemon.json

    2. Add "insecure-registries": ["172.30.120.243","temp-registry.daocloud.io"], to the content after modification:

      Modify Configuration

    3. Restart docker

      systemctl restart docker
      systemctl daemon-reload
      
    1. Modify /etc/containerd/config.toml

    2. After modification, the content should be as follows:

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
      endpoint = ["https://registry-1.docker.io"]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."temp-registry.daocloud.io"]
      endpoint = ["http://temp-registry.daocloud.io"]
      [plugins."io.containerd.grpc.v1.cri".registry.configs."http://temp-registry.daocloud.io".tls]
      insecure_skip_verify = true
      

      Modify Configuration

    3. Pay attention to spaces and line breaks, ensure the configuration is correct, and after modification, execute:

      systemctl restart containerd
      
  13. When creating a cluster, enabling Kernel Tuning for New Clusters in advanced settings causes cluster creation failure.

    1. Check if the conntrack kernel module is loaded by running the following command:

      lsmod |grep conntrack
      
    2. If it returns empty, it means it is not loaded. Reload it by running the following command:

      modprobe ip_conntrack
      

    Note

    Upgrading the kernel module can also cause cluster creation failures.

Comments