SYSZUX
首发于SYSZUX
使用kubeadm安装Kubernetes 1.11

使用kubeadm安装Kubernetes 1.11

背景

最近kubernetes 1.11发布了,想比于gemfield之前的使用kubeadm创建一个K8s的Cluster ,1.11带来了一些新的变化(比如coredns,比如nvidia device plugin)。原本gemfield只是想更新在之前的文章中,无奈知乎的系统有bug,无法进行再编辑,于是另起炉灶,就是gemfield此文了。

对于安装1.11的步骤,因为大量的操作步骤和使用kubeadm创建一个K8s的Cluster 一文中是一样的,在本文中,gemfield将不再赘述细节,但依旧会注明这一步骤(也就是说,某一个步骤如果没有写明详细操作步骤,那么你可以参考gemfield之前的专栏文章:使用kubeadm创建一个K8s的Cluster)。

在每台机器上的准备工作

注意是每台!

0,如果之前安装过K8s,卸载:

#drain node
gemfield@master:~$ kubectl drain ml --delete-local-data --force --ignore-daemonsets
node/ml cordoned
WARNING: Ignoring DaemonSet-managed pods: calico-node-tcdqs, kube-proxy-kbv6z, nvidia-device-plugin-daemonset-pl92p; Deleting pods with local storage: kubernetes-dashboard-6948bdb78-m4fkr
pod/coredns-78fcdf6894-jkfxn evicted
pod/gemfield-cuda-786f8b7c8c-8knj2 evicted
pod/gemfield-nginx-77f7597875-kxl78 evicted
pod/default-http-backend-846b65fb5f-zbnbz evicted
pod/gemfield-ubuntu-6cc5cc5fcd-q27g2 evicted
pod/kubernetes-dashboard-6948bdb78-m4fkr evicted
pod/nginx-ingress-controller-d658896cd-shngg evicted
pod/hello-world-86cddf59d5-4lppl evicted
pod/curl-87b54756-cpd66 evicted
pod/hello-world-86cddf59d5-wz7qb evicted
pod/hello-world-86cddf59d5-d9wxm evicted
pod/hello-world-86cddf59d5-4qlmp evicted
pod/hello-world-86cddf59d5-fdmzv evicted

#delete node
gemfield@master:~$ kubectl delete node ml
node "ml" deleted

#在每个node上reset
gemfield@ML:~$ sudo kubeadm reset
[sudo] password for gemfield: 
[reset] WARNING: changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] are you sure you want to proceed? [y/N]: y
[preflight] running pre-flight checks
[reset] stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
[reset] removing kubernetes-managed containers
[reset] cleaning up running containers using crictl with socket /var/run/dockershim.sock
[reset] failed to list running pods using crictl: exit status 1. Trying to use docker instead[reset] no etcd manifest found in "/etc/kubernetes/manifests/etcd.yaml". Assuming external etcd
[reset] deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes]
[reset] deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

#对每个node都执行上面3个操作:
......

1,更改hostname,名字为你想要的名字;

2,禁掉swap分区;

3,看是否能检测到机器的MAC和product_uuid;

4,确保K8s要使用的端口没有被占用;

5,安装docker;

6,如果想成为GPU node的话,安装nvidia-docker2,并设置docker的默认runtime为nvidia-docker2;

7,cgroup驱动问题

当使用docker的时候,kubeadm将会为kubelet自动检测cgroup驱动,当kubelet运行时,这个值将被设置在文件/var/lib/kubelet/kubeadm-flags.env中:

gemfield@master:~$ cat /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --cni-bin-dir=/opt/cni/bin --cni-conf-dir=/etc/cni/net.d --network-plugin=cni --resolv-conf=/run/systemd/resolve/resolv.conf

如果你使用的是不同的CRI(注意:仅当你使用的CRI的cgroup驱动不是cgroupfs的时候,糯你才需要这么做),那么你必须修改/etc/default/kubelet文件中cgroup-driver的值:

KUBELET_KUBEADM_EXTRA_ARGS=--cgroup-driver=<value>

/etc/default/kubelet文件将会被下面的命令使用:

kubeadm init

kubeadm join

如果修改了这个/etc/default/kubelet,必须重启kubelet服务:

systemctl daemon-reload
systemctl restart kubelet


8,安装kubeadm、kubelet、kubectl;


初始化master node

现在是时候选择一台机器作为你的K8s cluster的master机器了。

1,使用kubeadm init 去初始化master node

注意你可能需要设置http代理,因为kubeadm init需要访问dl.k8s.io去获取package的信息;你可能还需要设置docker daemon的代理,因为kubeadm init要从k8s.gcr.io(这是google cloud的container registry)上pull一些image;你还需要设置no_proxy环境变量为master的IP。否则会报错:“Unable to update cni config: No networks found in /etc/cni/net.d”。

注意--pod-network-cidr=172.16.0.0/16这个参数,cidr的选取一定不要和你本地的网络有冲突。

2,去掉kubenetes配置文件中的http_proxy:

这些ENV变量都是从当前的会话中继承过去的(前面的步骤为了翻墙), 把这些值删掉:

root@master:~# find /etc/kubernetes/ -type f -exec grep -n 17030 {} \+
/etc/kubernetes/manifests/kube-controller-manager.yaml:30:      value: http://localhost:17030
/etc/kubernetes/manifests/kube-controller-manager.yaml:32:      value: http://localhost:17030
/etc/kubernetes/manifests/kube-apiserver.yaml:45:      value: http://localhost:17030
/etc/kubernetes/manifests/kube-apiserver.yaml:47:      value: http://localhost:17030
/etc/kubernetes/manifests/kube-scheduler.yaml:21:      value: http://localhost:17030
/etc/kubernetes/manifests/kube-scheduler.yaml:23:      value: http://localhost:17030

然后重启kubelet service:

gemfield@master:~$ sudo systemctl restart kubelet.service

3,使用非root用户执行一些配置

以方便运行 kubectl命令。

4,安装Calico pod network

注意修改官方的calico.yaml 中的CALICO_IPV4POOL_CIDR的值来避免和宿主机所在的局域网段冲突(gemfield就是把原始的192.168.0.0/16 修改成了172.16.0.0/16);

5,等待所有的pod的状态都是running;

添加worker node

1,添加worker node

登陆到想要添加到cluster的worker node上,执行kubeadm join

2,如何成为GPU节点呢?

我知道,对于搞机器学习的朋友们比较关心的就是GPU了,在前述步骤中,我们已经看到了这样的描述:如果想成为GPU node的话,安装nvidia-docker2,并设置docker的默认runtime为nvidia-docker2。

在这个基础上,我们还需要在所有的GPU node上启动DevicePlugins功能。如果你的系统使用的是systemd服务的话,那么:

2.1,打开kubeadm的systemd unit文件:/etc/systemd/system/kubelet.service.d/10-kubeadm.conf 添加下面的参数:

Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true"

2.2,重启kubelet服务:

$ sudo systemctl daemon-reload
$ sudo systemctl restart kubelet

2.3,开启cluster的GPU支持

当在所有的GPU node上都开启了上述功能后,可以通过部署下面的Daemonset来开启k8s的cluster的GPU支持功能:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml

2.4,一个使用GPU资源的例子:

NVIDIA GPUs can now be consumed via container level resource requirements using the resource name nvidia.com/gpu:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container      
      image: nvidia/cuda:9.0-devel      
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs
    - name: digits-container
      image: nvidia/digits:6.0      
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs

添加Dashboard

安装dashboard组件

gemfield@master:~$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yaml
secret/kubernetes-dashboard-certs created
serviceaccount/kubernetes-dashboard created
role.rbac.authorization.k8s.io/kubernetes-dashboard-minimal created
rolebinding.rbac.authorization.k8s.io/kubernetes-dashboard-minimal created
deployment.apps/kubernetes-dashboard created
service/kubernetes-dashboard created

然后通过如下方式访问:

#为了安全,必须代理到本地
kubectl proxy

#然后访问地址
http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/

当你在浏览器输入这个网址后,你会看到登陆界面,让你选择kubeconfig或者是token的方式。在Kubernetes 1.7及之后的release,默认登陆后将不再是admin权限了。你可以选择token方式,或者像Gemfield一样,直接把默认权限改为admin(有风险)。

1,token方式:在Kubernetes的安装过程中会默认安装很多Service账户,分别有不同的访问权限,要找到对应的token,你可以使用下面的方式:

gemfield@master:~/test$ kubectl -n kube-system get secret
NAME                                             TYPE                                  DATA      AGE
attachdetach-controller-token-b4tms              kubernetes.io/service-account-token   3         2d
......

gemfield@master:~/test$ kubectl -n kube-system describe secret replicaset-controller-token-57zzz
Name:         replicaset-controller-token-57zzz
Namespace:    kube-system
Labels:       <none>
Annotations:  kubernetes.io/service-account.name=replicaset-controller
              kubernetes.io/service-account.uid=ec80d18e-9801-11e8-9c81-3863bbb80fbf

Type:  kubernetes.io/service-account-token

Data
====
ca.crt:     1025 bytes
namespace:  11 bytes
token:      eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJyZXBsaWNhc2V0LWNvbnRyb2xsZXItdG9rZW4tNTd6enoiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicmVwbGljYXNldC1jb250cm9sbGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiZWM4MGQxOGUtOTgwMS0xMWU4LTljODEtMzg2M2JiYjgwZmJmIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmUtc3lzdGVtOnJlcGxpY2FzZXQtY29udHJvbGxlciJ9.EgX-79M1E93Icr4CnMYfX90g-ljgv704x7uUt2fMN7vcP6wIy3VcDdILiv1M_I2QjSSNgo_bVD9onEnaM9MyQL3pkeHGEjEtm3iJZD47MtBPleKjUdt1NPXS-HNKHWEBbSgaVoX3C5pU9a-nut8X6Q71tbJhcvzOnEbXYrxmWiRCYGPyU0h6cLYHzqrQMVyGVdoNpEphi2JtxePLrQ0yAhQcDqBROyHsM4DqfFJTa2tHJx0s8hPE-Dr2FdVtTxtH6H7JvL0fm97VA-2DgYIGca555Eaz0iOCgx2_DuMg5J5urIJJxnYO4Tog_Xs_ON5y62VBg6xyqXPcXzGXfS2lLQ

最后的token就是你需要的。

2,默认admin的方式:

#创建dashboard-gemfield-admin.yaml文件
gemfield@master:~/test$ cat dashboard-gemfield-admin.yaml 
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: kubernetes-dashboard
  labels:
    k8s-app: kubernetes-dashboard
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: kubernetes-dashboard
  namespace: kube-system

#创建role
gemfield@master:~/test$ kubectl create -f dashboard-gemfield-admin.yaml

然后在登陆界面直接SKIP,进去后就是admin了。

你可能遇到的错误

1,Unable to connect to the server: Access violation

gemfield@master:~$ kubectl get pods --all-namespaces -o wide
Unable to connect to the server: Access violation

这是因为你设置了http_proxy/https_proxy代理,去掉了就可以了。


2,dial tcp 10.96.0.1:443: i/o timeout

#dashboard组件报下面的错误:
gemfield@master:~$ kubectl -n kube-system logs -f kubernetes-dashboard-6948bdb78-m4fkr
......
2018/08/04 12:59:18 Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get https://10.96.0.1:443/version: dial tcp 10.96.0.1:443: i/o timeout

#dns组件报下面的错误:
gemfield@master:~$ kubectl -n kube-system logs -f coredns-78fcdf6894-jkfxn
......
E0804 12:57:16.794093       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0804 12:57:16.795146       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:320: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0804 12:57:16.796230       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout

这是因为你的网络组件(比如Calico)的IP_POOL和宿主机所在的局域网IP段冲突了。


3,kubeadm join一个新的worker node的时候报错:Unauthorized

gemfield@master:~$ kubeadm join 192.168.1.196:6443 --token 5q8yfk.kzf7tw2qrskufw40 --discovery-token-ca-cert-hash sha256:3967c70711c30a458b38df253cc5d60bf2c615cd5868d34348c05b8103429fbb
[discovery] Successfully established connection with API Server "192.168.1.196:6443"
I0809 16:59:14.179997   16530 join.go:260] [join] writing bootstrap kubelet config file at /etc/kubernetes/bootstrap-kubelet.conf
I0809 16:59:14.286026   16530 join.go:283] Stopping the kubelet
[kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.11" ConfigMap in the kube-system namespace
Unauthorized

这是因为你使用的token已经失效了,默认情况下,kubeadm init产生的token的有效期是24个小时;你肯定是在一天之后才kubeadm join的。你可以使用下面的命令来重新产生token:

gemfield@master:~/test/kubernetes-kafka/web$ kubeadm token create --print-join-command
kubeadm join 192.168.1.196:6443 --token 5q8yfk.kzf7tw2qrskufw40 --discovery-token-ca-cert-hash sha256:3967c70711c30a458b38df253cc5d60bf2c615cd5868d34348c05b8103429fbb

编辑于 2018-08-09

文章被以下专栏收录