跳转至

MySQL MGR 排障手册

常用命令

获取 root 密码

在 MySQL MGR 集群的命名空间下,查找以 -mgr-secret 结尾的 Secret 资源,这里以获取 kpanda-mgr 这个集群的 Secret 为例:

kubectl get secrets/kpanda-mgr-mgr-secret -n mcamel-system --template={{.data.rootPassword}} | base64 -d
root123!

查看集群状态

通过 MySQL 命令行查看:

mysqlsh -uroot -pPassword -- cluster status
{
    "clusterName": "kpanda_mgr", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "kpanda-mgr-0.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306": {
                "address": "kpanda-mgr-0.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "kpanda-mgr-1.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306": {
                "address": "kpanda-mgr-1.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306": {
                "address": "kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306"
}

Note

集群在正常情况下:

  • 所有的节点的status都为ONLINE状态。
  • 有一个节点的memberRole为PRIMARY,其他节点都为SECONDARY。

用 SQL 语句查看:SELECT * FROM performance_schema.replication_group_members\G

mysql> SELECT * FROM performance_schema.replication_group_members\G
*************************** 1. row ***************************
              CHANNEL_NAME: group_replication_applier
                 MEMBER_ID: 6f464f4e-ba96-11ee-a028-a225dc125542
               MEMBER_HOST: kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local
               MEMBER_PORT: 3306
              MEMBER_STATE: ONLINE
               MEMBER_ROLE: PRIMARY
            MEMBER_VERSION: 8.0.31
MEMBER_COMMUNICATION_STACK: MySQL
*************************** 2. row ***************************
              CHANNEL_NAME: group_replication_applier
                 MEMBER_ID: b3a53102-bfec-11ee-a821-0a8fb9f7d1ce
               MEMBER_HOST: kpanda-mgr-0.kpanda-mgr-instances.mcamel-system.svc.cluster.local
               MEMBER_PORT: 3306
              MEMBER_STATE: ONLINE
               MEMBER_ROLE: SECONDARY
            MEMBER_VERSION: 8.0.31
MEMBER_COMMUNICATION_STACK: MySQL
*************************** 3. row ***************************
              CHANNEL_NAME: group_replication_applier
                 MEMBER_ID: bdddd16f-bfec-11ee-a7a4-324c8edaca40
               MEMBER_HOST: kpanda-mgr-1.kpanda-mgr-instances.mcamel-system.svc.cluster.local
               MEMBER_PORT: 3306
              MEMBER_STATE: ONLINE
               MEMBER_ROLE: SECONDARY
            MEMBER_VERSION: 8.0.31
MEMBER_COMMUNICATION_STACK: MySQL
3 rows in set (0.00 sec)

查看成员状态

查看成员状态:SELECT * FROM performance_schema.replication_group_member_stats\G

mysql> SELECT * FROM performance_schema.replication_group_member_stats\G
*************************** 1. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 17066729271025607:9
                                 MEMBER_ID: 6f464f4e-ba96-11ee-a028-a225dc125542
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 4748638
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 1109
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: 6f464f4e-ba96-11ee-a028-a225dc125542:1-10,
95201c7d-ba96-11ee-a018-bed74fb0bf8d:1-8,
b438e224-ba96-11ee-bc57-bed74fb0bf8d:1-12516339,
b439a7d3-ba96-11ee-bc57-bed74fb0bf8d:1-18
            LAST_CONFLICT_FREE_TRANSACTION: b438e224-ba96-11ee-bc57-bed74fb0bf8d:12519298
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 6
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 4748638
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 2. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 17066729271025607:9
                                 MEMBER_ID: b3a53102-bfec-11ee-a821-0a8fb9f7d1ce
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 4514132
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 1110
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: 6f464f4e-ba96-11ee-a028-a225dc125542:1-10,
95201c7d-ba96-11ee-a018-bed74fb0bf8d:1-8,
b438e224-ba96-11ee-bc57-bed74fb0bf8d:1-12519027,
b439a7d3-ba96-11ee-bc57-bed74fb0bf8d:1-18
            LAST_CONFLICT_FREE_TRANSACTION: b438e224-ba96-11ee-bc57-bed74fb0bf8d:12520590
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 4514129
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 3. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 17066729271025607:9
                                 MEMBER_ID: bdddd16f-bfec-11ee-a7a4-324c8edaca40
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 4658713
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 1093
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: 6f464f4e-ba96-11ee-a028-a225dc125542:1-10,
95201c7d-ba96-11ee-a018-bed74fb0bf8d:1-8,
b438e224-ba96-11ee-bc57-bed74fb0bf8d:1-12519027,
b439a7d3-ba96-11ee-bc57-bed74fb0bf8d:1-18
            LAST_CONFLICT_FREE_TRANSACTION: b438e224-ba96-11ee-bc57-bed74fb0bf8d:12520335
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 4658715
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
3 rows in set (0.00 sec)

指定成员角色

  1. 将某个节点指定为 PRIMARY。

    select group_replication_set_as_primary('4697c302-3e52-11ed-8e61-0050568a658a');
    
  2. mysqlsh 语法

    JS > var c=dba.getCluster()
    JS > c.status()
    JS > c.setPrimaryInstance('172.30.71.128:3306')
    

常见故障

某个 SECONDARY 节点为非 ONLINE 状态

{
    "clusterName": "mgr0117",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "mgr0117-2.mgr0117-instances.m0103.svc.cluster.local:3306",
        "ssl": "REQUIRED",
        "status": "OK_NO_TOLERANCE_PARTIAL",
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.",
        "topology": {
            "mgr0117-0.mgr0117-instances.m0103.svc.cluster.local:3306": {
                "address": "mgr0117-0.mgr0117-instances.m0103.svc.cluster.local:3306",
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ],
                "memberRole": "SECONDARY",
                "memberState": "OFFLINE",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)",
                "version": "8.0.31"
            },
            "mgr0117-1.mgr0117-instances.m0103.svc.cluster.local:3306": {
                "address": "mgr0117-1.mgr0117-instances.m0103.svc.cluster.local:3306",
                "memberRole": "SECONDARY",
                "mode": "R/O",
                "readReplicas": {},
                "replicationLag": "applier_queue_applied",
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.31"
            },
            "mgr0117-2.mgr0117-instances.m0103.svc.cluster.local:3306": {
                "address": "mgr0117-2.mgr0117-instances.m0103.svc.cluster.local:3306",
                "memberRole": "PRIMARY",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": "applier_queue_applied",
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.31"
            }
        },
        "topologyMode": "Single-Primary"
    },
    "groupInformationSourceMember": "mgr0117-2.mgr0117-instances.m0103.svc.cluster.local:3306"
}

这里看到对应的 address 字段是 mgr0117-0.mgr0117-instances.m0103.svc.cluster.local:3306,进入 mgr0117-0 这个 pod,执行

mysql> start group_replication;
Query OK, 0 rows affected (5.82 sec)

这里如果数据量比较大,该节点会处于比较长时间的 RECOVERING 状态。

没有 PRIMARY 节点,各个节点都显示 OFFLINE

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier |           |             |        NULL | OFFLINE      |             |                |                            |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
1 row in set (0.00 sec)

此时可尝试从 mysql shell 重启集群:

dba.rebootClusterFromCompleteOutage().

若依然不能解决,则使用 cmd 方式登录之前的 PRIMARY 的节点,然后启动该节点的 group replication:

set global group_replication_bootstrap_group=on;
start group_replication;
set global group_replication_bootstrap_group=off;

Warning

对于其他节点,依次执行上面的命令。

MGR 的 Pod 一直处于 terminating 状态

需要检查:

  1. Kubernetes 集群的各个组件可能不正常,检查相关组件状态,尤其是 etcd
  2. MGR 的 Operator 状态是否正常,检查 Operator Pod 的日志

如果处于测试环境,需要快速删除 Pod,可以删除相关 pod 的 finalizers:

kubectl edit pod <your-pod-name>

在 Pod 的 YAML 中删除以下几行:

finalizers:
- mysql.oracle.com/membership
- kopf.zalando.org/KopfFinalizerMarker

评论