1. 现象
在k8s集群的部署过程中,在初始化k8s master节点之后,准备通过如下kubeadm join命令添加当前worker节点到k8s集群,
kubeadm join xxx:6443 --token xxx.xxx \
> --discovery-token-ca-cert-hash sha256:xxxx
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
根据日志输出,可知当前节点已加入集群,通过kubectl get nodes命令可以正常看到节点的状态为ready,但是通过kubectl get pods -A命令查看pods状态时,看到如下的CrashLoopBackOff的异常状态,
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kube-flannel-ds-j69g6 0/1 CrashLoopBackOff 3 18m
通过kubectl logs命令查询pod日志可以看到报pod cidr not assigned的异常信息,
$ kubectl logs kube-flannel-ds-j69g6 -n kube-system
I0218 06:23:21.796296 1 main.go:518] Determining IP address of default interface
I0218 06:23:21.796512 1 main.go:531] Using interface with name eth0 and address 10.250.41.77
I0218 06:23:21.796525 1 main.go:548] Defaulting external address to interface address (10.250.41.77)
W0218 06:23:21.796537 1 client_config.go:517] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0218 06:23:21.906396 1 kube.go:119] Waiting 10m0s for node controller to sync
I0218 06:23:21.906791 1 kube.go:306] Starting kube subnet manager
I0218 06:23:22.906882 1 kube.go:126] Node controller sync successful
I0218 06:23:22.906912 1 main.go:246] Created subnet manager: Kubernetes Subnet Manager - worker-0001
I0218 06:23:22.906918 1 main.go:249] Installing signal handlers
I0218 06:23:22.906963 1 main.go:390] Found network config - Backend type: vxlan
I0218 06:23:22.907016 1 vxlan.go:121] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
E0218 06:23:22.907246 1 main.go:291] Error registering network: failed to acquire lease: node "worker-0001" pod cidr not assigned
I0218 06:23:22.907272 1 main.go:370] Stopping shutdownHandler...
当前发生问题的k8s版本为1.20.0,其中配置了flannel作为网络CNI插件。
2. 快速解决方案:手动分配podCIDR
这个问题主要是由于worker节点的flannel组件无法正常获取podCIDR的定义,一个快速的解决方法:可以通过执行如下命令对相应的worker节点添加podCIDR配置,
# 注意:每个worker节点的SUBNET需要区分开,否则k8s pods之间网络访问会不通。
kubectl patch node <NODE_NAME> -p '{"spec":{"podCIDR":"<SUBNET>"}}'
然后可以再次查看节点信息,
# 如下配置是cluster-cidr=172.18.0.0/16所指定网段范围内的一个子网段
$ kubectl patch node worker-0001 -p '{"spec":{"podCIDR":"172.18.1.0/24"}}'
$ kubectl describe node worker-0001
......
PodCIDR: 172.18.1.0/24
......
过一段时间,再次查看kube-flannel-ds-j69g6 pod,可以看到已经可以正常启动,状态为RUNNING。
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kube-flannel-ds-j69g6 0/1 RUNNING 3 18m
这个解决方案是通过手动指定worker节点的可分配IP地址域,但是这个不是最佳的解决方案,最好找到根本原因,让flannel自动配置各个worker节点的podCIDR。
3. 异常根因和自动分配podCIDR
在k8s master集群部署中,通过kubeadm init初始化master节点时,flannel网络插件需要确保初始化命令配置了podCIDR如下启动参数,
- --pod-network-cidr=172.18.0.0/16
初始化完毕之后,可以在配置文件/etc/kubernetes/manifest/kube-controller-manager.yaml中看到如下信息,
- --allocate-node-cidrs=true
- --cluster-cidr=172.18.0.0/16
同时在安装flannel cni网络插件时,通过kubectl apply kube-flannel.yml命令,kube-flannel配置文件中的Network参数需要和pod-network-cid保持一致,
net-conf.json: |
{
"Network": "172.18.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
在k8s集群上述初始化的过程中,若不小心出现如下情况,
- 对master主机初始化时,有1台或多台未正常配置podCIDR参数,即在kubeadm init命令中未添加pod-network-cid参数。
- 通过kubectl apply kube-flannel.yml命令添加flannel网络插件时,其指定的Network和pod-network-cid不一致。
都会导致flannel插件无法自动识别和分配pod cidr,从而出现pod cidr not assigned异常问题。
若出现此类问题,可以详细检查如上k8s集群配置。在配置都正确的情况下,各个k8s worker节点上的flannel将能够正常自动配置各个worker节点的podCIDR,如下为三个worker节点上查询到自动分配的网段,
# master-0001
# 172.18.0.1/24
$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=172.18.0.0/16
FLANNEL_SUBNET=172.18.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
# worker-0002
# 172.18.1.1/24
$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=172.18.0.0/16
FLANNEL_SUBNET=172.18.1.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
# worker-0003
# 172.18.2.1/24
$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=172.18.0.0/16
FLANNEL_SUBNET=172.18.2.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true