跳转至

基于 Hwameistor 的 Elasticsearch 迁移实践

由于 Kubernetes 自身的特性,有状态应用部署完成后是否可以迁移取决于底层 CSI 的能力。然而当集群出现资源不均等意外情况时,需要跨节点迁移相关的有状态应用。

本文以 Elasticsearch 为例,参考 Hwameistor 官方提供的迁移指南,演示使用 Hwameistor 时如何跨节点迁移数据服务中间件。

演示环境

从集群信息、ES 安装信息、PVC 三方面进行介绍演示环境:

[root@prod-master1 ~]# kubectl get node
NAME           STATUS   ROLES           AGE   VERSION
prod-master1   Ready    control-plane   15h   v1.25.4
prod-master2   Ready    control-plane   15h   v1.25.4
prod-master3   Ready    control-plane   15h   v1.25.4
prod-worker1   Ready    <none>          15h   v1.25.4
prod-worker2   Ready    <none>          15h   v1.25.4
prod-worker3   Ready    <none>          15h   v1.25.4
[root@prod-master1 ~]# kubectl get pods -o wide | grep es-cluster-masters-es-data

mcamel-common-es-cluster-masters-es-data-0 Running prod-worker1
mcamel-common-es-cluster-masters-es-data-1 Running prod-worker3
mcamel-common-es-cluster-masters-es-data-2 Running prod-worker2
kubectl -n mcamel-system get pvc -l elasticsearch.k8s.elastic.co/statefulset-name=mcamel-common-es-cluster-masters-es-data
NAME                                                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                 AGE
elasticsearch-data-mcamel-common-es-cluster-masters-es-data-0   Bound    pvc-61776435-0df5-448f-abb9-4d06774ec0e8   35Gi       RWO            hwameistor-storage-lvm-hdd   15h
elasticsearch-data-mcamel-common-es-cluster-masters-es-data-1   Bound    pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c   35Gi       RWO            hwameistor-storage-lvm-hdd   15h
elasticsearch-data-mcamel-common-es-cluster-masters-es-data-2   Bound    pvc-955bd221-3e83-4bb5-b842-c11584bced10   35Gi       RWO            hwameistor-storage-lvm-hdd   15h

演示目标

prod-worker3 节点上的 mcamel-common-es-cluster-masters-es-data-1 (以下简称演示应用/esdata-1)有状态应用跨节点迁移到 prod-master3 节点。

准备工作

确定需要迁移的 PV

使用如下命令查找演示应用 esdata-1 对应的 PV 磁盘,明确需要迁移哪个 PV

  1. 查看演示应用绑定的 PVC

    [root@prod-master1 ~]# kubectl -n mcamel-system get pod mcamel-common-es-cluster-masters-es-data-1 -ojson | jq .spec.volumes[0]
    {
      "name": "elasticsearch-data",
      "persistentVolumeClaim": {
        "claimName": "elasticsearch-data-mcamel-common-es-cluster-masters-es-data-1"
      }
    }
    
  2. 查看该 PVC 绑定的 PV

    [root@prod-master1 ~]# kubectl -n mcamel-system get pvc elasticsearch-data-mcamel-common-es-cluster-masters-es-data-1
    NAME                                                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                 AGE
    elasticsearch-data-mcamel-common-es-cluster-masters-es-data-1   Bound    pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c   35Gi       RWO            hwameistor-storage-lvm-hdd   17h
    
  3. 确认该 PV 绑定的应用是否为需要迁移的应用,即此文中的演示应用 esdata-1

    [root@prod-master1 ~]# kubectl -n mcamel-system get pv pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c 35Gi RWO Delete Bound mcamel-system/elasticsearch-data-mcamel-common-es-cluster-masters-es-data-1 hwameistor-storage-lvm-hdd 17h ```

上述信息证明,需要迁移的 PVpvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c

停止运行待迁移应用

  1. 查看当前正在运行的应用

    [root@prod-master1 ~]# kubectl -n mcamel-system get sts
    NAME                                       READY   AGE
    elastic-operator                           2/2     20h
    mcamel-common-es-cluster-masters-es-data   3/3     20h
    mcamel-common-kpanda-mysql-cluster-mysql   2/2     20h
    mcamel-common-minio-cluster-pool-0         1/1     20h
    mcamel-common-mysql-cluster-mysql          2/2     20h
    mysql-operator                             1/1     20h
    rfr-mcamel-common-redis-cluster            3/3     20h
    
  2. 停止运行 ES operator

    [root@prod-master1 ~]# kubectl -n mcamel-system scale --replicas=0 sts elastic-operator
    
  3. 停止运行 ES:

    [root@prod-master1 ~]# kubectl -n mcamel-system scale --replicas=0 sts mcamel-common-es-cluster-masters-es-data
    # --- wait about 3 mins ----
    
  4. 确认 ES 已经停止运行

    [root@prod-master1 ~]# kubectl -n mcamel-system get sts
    NAME                                       READY   AGE
    elastic-operator                           0/0     20h
    mcamel-common-es-cluster-masters-es-data   0/0     20h
    mcamel-common-kpanda-mysql-cluster-mysql   2/2     20h
    mcamel-common-minio-cluster-pool-0         1/1     20h
    mcamel-common-mysql-cluster-mysql          2/2     20h
    mysql-operator                             1/1     20h
    rfr-mcamel-common-redis-cluster            3/3     20h
    

视频演示如下:

asciicast

开始迁移

有关此过程的详细说明,可参考 Hwameistor 官方文档:迁移数据卷

  1. 建立迁移任务

    [root@prod-master1 ~]# cat migrate.yaml
    apiVersion: hwameistor.io/v1alpha1
    kind: LocalVolumeMigrate
    metadata:
      namespace: hwameistor
      name: migrate-es-pvc # 任务名称
    spec:
      sourceNode: prod-worker3 # 来源 node,可以通过 `kubectl get ldn` 获取
      targetNodesSuggested:
      - prod-master3
      volumeName: pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c # 需要迁移的 pvc
      migrateAllVols: false
    
  2. 执行迁移命令

    [root@prod-master1 ~]# kubectl apply -f migrate.yaml
    

    此时会在 hwameistor 命名空间创建一个 pod,用于执行迁移动作。

  3. 查看迁移状态

    [root@prod-master1 ~]# kubectl get localvolumemigrates.hwameistor.io  migrate-es-pvc -o yaml
    apiVersion: hwameistor.io/v1alpha1
    kind: LocalVolumeMigrate
    metadata:
      annotations:
        kubectl.kubernetes.io/last-applied-configuration: |
          {"apiVersion":"hwameistor.io/v1alpha1","kind":"LocalVolumeMigrate","metadata":{"annotations":{},"name":"migrate-es-pvc"},"spec":{"migrateAllVols":false,"sourceNode":"prod-worker3","targetNodesSuggested":["prod-master3"],"volumeName":"pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c"}}
      creationTimestamp: "2023-04-30T12:24:17Z"
      generation: 1
      name: migrate-es-pvc
      resourceVersion: "1141529"
      uid: db3c0df0-57b5-42ef-9ec7-d8e6de487767
    spec:
      abort: false
      migrateAllVols: false
      sourceNode: prod-worker3
      targetNodesSuggested:
      - prod-master3
      volumeName: pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c
    status:
      message: 'waiting for the sync job to complete: migrate-es-pvc-datacopy-elasticsearch-data-mcamel'
      originalReplicaNumber: 1
      state: SyncReplica
      targetNode: prod-master3
    
  4. 迁移完成后,查看迁移结果

    [root@prod-master1 ~]# kubectl get lvr
    NAME CAPACITY NODE STATE SYNCED DEVICE AGE  
    pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c 37580963840 prod-master3 Ready true /dev/LocalStorage_PoolHDD/pvc-7d4c45c9-49d6-4684-aca2-8b853d0c335c 129s  
    

恢复 common-es

  1. 启动 ES operator

    [root@prod-master1 ~]# kubectl -n mcamel-system scale --replicas=2 sts elastic-operator
    
  2. 启动 ES

    [root@prod-master1 ~]# kubectl -n mcamel-system scale --replicas=3 sts mcamel-common-es-cluster-masters-es-data

相关问题

HwameiStor 使用 rclone 来迁移 PV,而 rclone 在迁移过程中可能会丢失权限(参考 rclone#1202hwameistor#830)。如果权限丢失,ES 会启动失败并反复启动,陷入恶性循环。

遇到类似问题时可以通过下述步骤排查并解决故障。

确认问题

使用以下命令查看 Pod 日志:

``bash kubectl -n mcamel-system logs mcamel-common-es-cluster-masters-es-data-0 -c elasticsearch

如果日志中包含如下错误信息,则可以确认为权限丢失造成的问题。

```log
java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?

解决故障

  1. 运行下命令修改 ES 的 CR

    kubectl -n mcamel-system edit elasticsearches.elasticsearch.k8s.elastic.co mcamel-common-es-cluster-masters
    
  2. 为 ES 的 Pod 添加一个 initcontainer,内容如下:

            - command:
              - sh
              - -c
              - chown -R elasticsearch:elasticsearch /usr/share/elasticsearch/data
              name: change-permission
              resources: {}
              securityContext:
                privileged: true
    

    initcontainer 在 CR 中的位置如下:

    spec:
      ...
      ...
      nodeSets:
      - config:
          node.store.allow_mmap: false
        count: 3
        name: data
        podTemplate:
          metadata: {}
          spec:
            ...
            ...
            initContainers:
            - command:
              - sh
              - -c
              - sysctl -w vm.max_map_count=262144
              name: sysctl
              resources: {}
              securityContext:
                privileged: true
            - command:
              - sh
              - -c
              - chown -R elasticsearch:elasticsearch /usr/share/elasticsearch/data
              name: change-permission
              resources: {}
              securityContext:
                privileged: true
    

评论