Optimizing Failover Delay Sensitivity¶

Multicloud support enables cross-cluster automatic failover for applications, thereby ensuring the stability of applications deployed across multiple clusters. The delay sensitivity of failover is mainly influenced by the following two dimensions of metrics, which need to be configured in combination to achieve the desired delay sensitivity effect.

Cluster Dimension: Duration for marking a cluster as unhealthy, Cluster eviction tolerance duration
Workload Dimension: Cluster taint tolerance duration

Introduction to Failover Features¶

After enabling failover in DCE 5.0 Multicloud Management, the following configuration options are provided:

Parameter	Description	Field Name	Default Value
ClusterMonitorPeriod	Interval for checking cluster status	Mark health check duration at startup	60s
ClusterMonitorGracePeriod	If the cluster health status is not obtained within this configured time during runtime, the cluster will be marked as unhealthy	The runtime marks the duration of an unhealthy check	40s
ClusterStartupGracePeriod	If the cluster health status is not obtained within this configured time at startup, the cluster will be marked as unhealthy	Graceful ejection timeout duration	600s
FailoverEvictionTimeout	After a cluster is marked as unhealthy, it will be tainted and enter eviction state if this duration is exceeded (cluster will be tainted with eviction)	Eviction tolerance time	30s
ClusterTaintEvictionRetryFrequency	Maximum waiting duration after entering the graceful eviction queue, after which immediate deletion will occur	Check Interval	5s

Timeline for Workload Eviction¶

A simple explanation of the diagram below: We stipulate that the cluster API is called every 10 seconds to record the health status of the cluster. When all four results are healthy, we consider the cluster to be healthy. At this point, if the TCP connection between DCE and the cluster API server is disconnected for 10-20 seconds and the cluster health status is not obtained, the cluster will be considered abnormal. If the cluster does not recover health within the specified time, it will be marked as unhealthy and tainted with NoSchedule. If it exceeds the specified eviction tolerance duration, it will be tainted with NoExecute and eventually evicted.

Optimization Configuration for Multicloud Instances¶

In a multicloud instance, you need to enter the advanced settings -> failover section. The following configurations can refer to the above diagram to fill in parameter information.

Failover

Configuration Optimization for Multicloud Workloads¶

The configuration optimization for multicloud workloads is mainly related to their propagation policy (PP). The proper cluster taint tolerance duration needs to be modified in the propagation policy.

Propagation Policies

Optimizing Failover Delay Sensitivity¶

Introduction to Failover Features¶

Timeline for Workload Eviction¶

Optimization Configuration for Multicloud Instances¶

Configuration Optimization for Multicloud Workloads¶

Comments