EgressGateway Failover¶
Controller Failover¶
EgressGateway control plane failover is managed by specifying the controller.replicas
parameter during installation. If one of the multiple controller replicas fails, the system automatically selects another replica as the primary controller to ensure continuous service availability.
Datapath Failover¶
For datapath failover, you can designate a set of nodes as Egress Nodes when creating an EgressGateway
by using nodeSelector
. The Egress IP will be bound to one of these nodes. If a node fails or its Egress Agent encounters an issue, the Egress IP will automatically transfer to another available node, ensuring service continuity and reliability.
apiVersion: egressgateway.spidernet.io/v1beta1
kind: EgressGateway
metadata:
name: egw1
spec:
clusterDefault: true
ippools:
ipv4:
- 10.6.1.55
- 10.6.1.56
ipv4DefaultEIP: 10.6.1.56
ipv6:
- fd00::55
- fd00::56
ipv6DefaultEIP: fd00::55
nodeSelector:
selector:
matchLabels:
egress: "true"
status:
nodeList:
- name: node1
status: Ready
eips:
- ipv4: 10.6.1.56
ipv6: fd00::55
policies:
- name: policy1
namespace: default
- name: node2
status: Ready
In the above EgressGateway
definition, by setting egress: "true"
, both node1
and node2
are designated as Egress Nodes. node1
is the active node, and its assigned Egress IPs can be viewed in the status
field. If node1
fails, node2
will take over as the failover node.
Configuring Failover Timing¶
You can fine-tune the status check and Egress IP failover timing using Helm values:
feature.tunnelMonitorPeriod
: The interval (in seconds) at which the Egress Controller checks the last update status of theEgressTunnel
. Default:5
.feature.tunnelUpdatePeriod
: The interval (in seconds) at which the Egress Agent updates theEgressTunnel
status. Default:5
.feature.eipEvictionTimeout
: If the last update time of anEgressTunnel
exceeds this timeout, the Egress IP will be moved to another available node. Default:5
seconds.
apiVersion: egressgateway.spidernet.io/v1beta1
kind: EgressTunnel
metadata:
name: workstation1
spec: {}
status:
lastHeartbeatTime: "2023-11-27T12:04:56Z"
mark: "0x26d9b723"
phase: Ready
The EgressGateway
Agent periodically updates the status.lastHeartbeatTime
field based on feature.tunnelUpdatePeriod
, while the EgressGateway
Controller periodically checks all EgressTunnel
instances based on feature.tunnelMonitorPeriod
. If the sum of status.lastHeartbeatTime
and feature.eipEvictionTimeout
exceeds the current time, the Egress IP will be reassigned to another available node.
Troubleshooting Datapath Failover Issues¶
If issues arise with failover, follow these steps:
-
Review the
values.yaml
file used for deploying EgressGateway to ensure that failover-related parameters are correctly set. Specifically, make sure that theeipEvictionTimeout
value is greater than the sum oftunnelMonitorPeriod
andtunnelUpdatePeriod
. -
Run the following command to monitor the status of
EgressTunnel
instances. Check whether the selected node is in aHeartbeatTimeout
state and verify if other nodes are in aReady
state. -
To check if there was an IP switch due to
HeartbeatTimeout
, search for logs containingupdate tunnel status to HeartbeatTimeout
in the controller container.