MySQL MGR 排障手册¶
常用命令¶
获取 root 密码¶
在 MySQL MGR 集群的命名空间下,查找以 -mgr-secret
结尾的 Secret 资源,这里以获取 kpanda-mgr
这个集群的 Secret 为例:
kubectl get secrets/kpanda-mgr-mgr-secret -n mcamel-system --template={{.data.rootPassword}} | base64 -d
root123!
查看集群状态¶
通过 MySQL 命令行查看:
{
"clusterName": "kpanda_mgr",
"defaultReplicaSet": {
"name": "default",
"primary": "kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"kpanda-mgr-0.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306": {
"address": "kpanda-mgr-0.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
},
"kpanda-mgr-1.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306": {
"address": "kpanda-mgr-1.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
},
"kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306": {
"address": "kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local:3306"
}
Note
集群在正常情况下:
- 所有的节点的status都为ONLINE状态。
- 有一个节点的memberRole为PRIMARY,其他节点都为SECONDARY。
用 SQL 语句查看:SELECT * FROM performance_schema.replication_group_members\G
mysql> SELECT * FROM performance_schema.replication_group_members\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
MEMBER_ID: 6f464f4e-ba96-11ee-a028-a225dc125542
MEMBER_HOST: kpanda-mgr-2.kpanda-mgr-instances.mcamel-system.svc.cluster.local
MEMBER_PORT: 3306
MEMBER_STATE: ONLINE
MEMBER_ROLE: PRIMARY
MEMBER_VERSION: 8.0.31
MEMBER_COMMUNICATION_STACK: MySQL
*************************** 2. row ***************************
CHANNEL_NAME: group_replication_applier
MEMBER_ID: b3a53102-bfec-11ee-a821-0a8fb9f7d1ce
MEMBER_HOST: kpanda-mgr-0.kpanda-mgr-instances.mcamel-system.svc.cluster.local
MEMBER_PORT: 3306
MEMBER_STATE: ONLINE
MEMBER_ROLE: SECONDARY
MEMBER_VERSION: 8.0.31
MEMBER_COMMUNICATION_STACK: MySQL
*************************** 3. row ***************************
CHANNEL_NAME: group_replication_applier
MEMBER_ID: bdddd16f-bfec-11ee-a7a4-324c8edaca40
MEMBER_HOST: kpanda-mgr-1.kpanda-mgr-instances.mcamel-system.svc.cluster.local
MEMBER_PORT: 3306
MEMBER_STATE: ONLINE
MEMBER_ROLE: SECONDARY
MEMBER_VERSION: 8.0.31
MEMBER_COMMUNICATION_STACK: MySQL
3 rows in set (0.00 sec)
查看成员状态¶
查看成员状态:SELECT * FROM performance_schema.replication_group_member_stats\G
mysql> SELECT * FROM performance_schema.replication_group_member_stats\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 17066729271025607:9
MEMBER_ID: 6f464f4e-ba96-11ee-a028-a225dc125542
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 4748638
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 1109
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 6f464f4e-ba96-11ee-a028-a225dc125542:1-10,
95201c7d-ba96-11ee-a018-bed74fb0bf8d:1-8,
b438e224-ba96-11ee-bc57-bed74fb0bf8d:1-12516339,
b439a7d3-ba96-11ee-bc57-bed74fb0bf8d:1-18
LAST_CONFLICT_FREE_TRANSACTION: b438e224-ba96-11ee-bc57-bed74fb0bf8d:12519298
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 6
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 4748638
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 2. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 17066729271025607:9
MEMBER_ID: b3a53102-bfec-11ee-a821-0a8fb9f7d1ce
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 4514132
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 1110
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 6f464f4e-ba96-11ee-a028-a225dc125542:1-10,
95201c7d-ba96-11ee-a018-bed74fb0bf8d:1-8,
b438e224-ba96-11ee-bc57-bed74fb0bf8d:1-12519027,
b439a7d3-ba96-11ee-bc57-bed74fb0bf8d:1-18
LAST_CONFLICT_FREE_TRANSACTION: b438e224-ba96-11ee-bc57-bed74fb0bf8d:12520590
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 4514129
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 3. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 17066729271025607:9
MEMBER_ID: bdddd16f-bfec-11ee-a7a4-324c8edaca40
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 4658713
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 1093
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 6f464f4e-ba96-11ee-a028-a225dc125542:1-10,
95201c7d-ba96-11ee-a018-bed74fb0bf8d:1-8,
b438e224-ba96-11ee-bc57-bed74fb0bf8d:1-12519027,
b439a7d3-ba96-11ee-bc57-bed74fb0bf8d:1-18
LAST_CONFLICT_FREE_TRANSACTION: b438e224-ba96-11ee-bc57-bed74fb0bf8d:12520335
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 4658715
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
3 rows in set (0.00 sec)
指定成员角色¶
-
将某个节点指定为 PRIMARY。
-
mysqlsh
语法
常见故障¶
某个 SECONDARY 节点为非 ONLINE 状态¶
{
"clusterName": "mgr0117",
"defaultReplicaSet": {
"name": "default",
"primary": "mgr0117-2.mgr0117-instances.m0103.svc.cluster.local:3306",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.",
"topology": {
"mgr0117-0.mgr0117-instances.m0103.svc.cluster.local:3306": {
"address": "mgr0117-0.mgr0117-instances.m0103.svc.cluster.local:3306",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.31"
},
"mgr0117-1.mgr0117-instances.m0103.svc.cluster.local:3306": {
"address": "mgr0117-1.mgr0117-instances.m0103.svc.cluster.local:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
},
"mgr0117-2.mgr0117-instances.m0103.svc.cluster.local:3306": {
"address": "mgr0117-2.mgr0117-instances.m0103.svc.cluster.local:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "mgr0117-2.mgr0117-instances.m0103.svc.cluster.local:3306"
}
这里看到对应的 address 字段是 mgr0117-0.mgr0117-instances.m0103.svc.cluster.local:3306,进入 mgr0117-0 这个 pod,执行
这里如果数据量比较大,该节点会处于比较长时间的 RECOVERING 状态。
没有 PRIMARY 节点,各个节点都显示 OFFLINE¶
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | | | NULL | OFFLINE | | | |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+----------------------------+
1 row in set (0.00 sec)
此时可尝试从 mysql shell 重启集群:
若依然不能解决,则使用 cmd 方式登录之前的 PRIMARY 的节点,然后启动该节点的 group replication:
set global group_replication_bootstrap_group=on;
start group_replication;
set global group_replication_bootstrap_group=off;
Warning
对于其他节点,依次执行上面的命令。
MGR 的 Pod 一直处于 terminating 状态¶
需要检查:
- Kubernetes 集群的各个组件可能不正常,检查相关组件状态,尤其是 etcd
- MGR 的 Operator 状态是否正常,检查 Operator Pod 的日志
如果处于测试环境,需要快速删除 Pod,可以删除相关 pod 的 finalizers:
在 Pod 的 YAML 中删除以下几行: