kubeflow 1.0 安装教程

Kubeflow是Google推出的基于kubernetes环境下的机器学习组件,通过Kubeflow可以实现对TFJob等资源类型定义,可以像部署应用一样完成在TFJob分布式训练模型的过程。最初的设计是将Kubernetes和Tensorflow结合实现对Tensorflow分布式训练的支持。但是仅仅实现对Tensorflow的支持还是远远不够的,Kubeflow社区又陆续对各种深度学习框架进行支持,例如:MXNet, Caffee, PyTorch等。使得机器学习算法同学只需关心算法实现,而后续的模型训练和服务上线都交给平台来做,解放算法同学使其专做自己擅长的事儿。

目前Kubeflow仅支持在v1.15.11及以下版本的k8s上部署,v1.16及以上存在兼容性问题,具体情况可参考这篇文章,接下来我们基于Microk8s(版本1.15/stable)部署kubeflow v1.0

安装kubeflow

下载 kfctl binary from the Kubeflow releases page

解压安装包并添加到执行目录

1
2
tar -xvf kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz
sudo cp kfctl /usr/bin

设置环境

1
2
3
4
export BASE_DIR=/data/
export KF_NAME=my-kubeflow
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.1.yaml"

部署kubeflow

1
2
3
4
mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_URI}
kubectl -n kubeflow get all

输出如下日志即安装成功

查看当前 Kubernetes pods 状态

1
kubectl get pods --namespace kubeflow

发现kubeflow的pod大部分没有启动成功,原因还是网络问题,需要访问gcr.io下载镜像,之前我都是通过gcr.azk8s.cn镜像地址下载的,但是最近发现gcr.azk8s.cn这个地址无法使用了,所以只能通过自己构建然后保存到阿里云或者dockerhub镜像库里,大家可以直接使用我已经构建好的镜像,执行下面脚本即可,也可自己构建,具体构建流程可以参考这篇文章

执行以下脚本将镜像下载到本地并导入到microk8s

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#!/usr/bin/env bash

echo ""
echo "=========================================================="
echo "pull kubeflow v1.0 images from dockerhub ..."
echo "=========================================================="
echo ""

# registry.cn-hangzhou.aliyuncs.com/smartliby

gcr_imgs=(
"smartliby/kfserving-controller:0.2.2,gcr.io/kfserving/kfserving-controller:0.2.2"
"smartliby/api-server:0.2.0,gcr.io/ml-pipeline/api-server:0.2.0"
"smartliby/kfam:v1.0.0-gf3e09203,gcr.io/kubeflow-images-public/kfam:v1.0.0-gf3e09203"
"smartliby/ingress-setup:latest,gcr.io/kubeflow-images-public/ingress-setup:latest"
"smartliby/application:1.0-beta,gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta"
"smartliby/centraldashboard:v1.0.0-g3ec0de71,gcr.io/kubeflow-images-public/centraldashboard:v1.0.0-g3ec0de71"
"smartliby/jupyter-web-app:v1.0.0-g2bd63238,gcr.io/kubeflow-images-public/jupyter-web-app:v1.0.0-g2bd63238"
"smartliby/katib-controller:v0.8.0,gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:v0.8.0"
"smartliby/katib-db-manager:v0.8.0,gcr.io/kubeflow-images-public/katib/v1alpha3/katib-db-manager:v0.8.0"
"smartliby/katib-ui:v0.8.0,gcr.io/kubeflow-images-public/katib/v1alpha3/katib-ui:v0.8.0"
"smartliby/kube-rbac-proxy:v0.4.0,gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0"
"smartliby/metacontroller:v0.3.0,gcr.io/metacontroller/metacontroller:v0.3.0"
"smartliby/metadata:v0.1.11,gcr.io/kubeflow-images-public/metadata:v0.1.11"
"smartliby/envoy:metadata-grpc,gcr.io/ml-pipeline/envoy:metadata-grpc"
"smartliby/ml_metadata_store_server:v0.21.1,gcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1"
"smartliby/metadata-frontend:v0.1.8,gcr.io/kubeflow-images-public/metadata-frontend:v0.1.8"
"smartliby/visualization-server:0.2.0,gcr.io/ml-pipeline/visualization-server:0.2.0"
"smartliby/persistenceagent:0.2.0,gcr.io/ml-pipeline/persistenceagent:0.2.0"
"smartliby/scheduledworkflow:0.2.0,gcr.io/ml-pipeline/scheduledworkflow:0.2.0"
"smartliby/frontend:0.2.0,gcr.io/ml-pipeline/frontend:0.2.0"
"smartliby/viewer-crd-controller:0.2.0,gcr.io/ml-pipeline/viewer-crd-controller:0.2.0"
"smartliby/notebook-controller:v1.0.0-gcd65ce25,gcr.io/kubeflow-images-public/notebook-controller:v1.0.0-gcd65ce25"
"smartliby/profile-controller:v1.0.0-ge50a8531,gcr.io/kubeflow-images-public/profile-controller:v1.0.0-ge50a8531"
"smartliby/pytorch-operator:v1.0.0-g047cf0f,gcr.io/kubeflow-images-public/pytorch-operator:v1.0.0-g047cf0f"
"smartliby/spark-operator:v1beta2-1.0.0-2.4.4,gcr.io/spark-operator/spark-operator:v1beta2-1.0.0-2.4.4"
"smartliby/spartakus-amd64:v1.1.0,gcr.io/google_containers/spartakus-amd64:v1.1.0"
"smartliby/tf_operator:v1.0.0-g92389064,gcr.io/kubeflow-images-public/tf_operator:v1.0.0-g92389064"
"smartliby/admission-webhook:v1.0.0-gaf96e4e3,gcr.io/kubeflow-images-public/admission-webhook:v1.0.0-gaf96e4e3"
"smartliby/kfam:v1.0.0-gf3e09203,gcr.io/kubeflow-images-public/kfam:v1.0.0-gf3e09203"
"smartliby/api-server:0.2.0,gcr.io/ml-pipeline/api-server:0.2.0"
)

for img in ${gcr_imgs[@]}
do
img_array=(${img//,/ })
# 拉取镜像
docker pull ${img_array[0]}
# 添加Tag
image_name=${img_array[1]}
image_name=${image_name%@*}
docker tag ${img_array[0]} ${image_name}
# 输出
docker save ${image_name} > /data/k8s_img/kubeflow/${image_name##*/}.tar
# 输入
microk8s.ctr --namespace k8s.io image import /data/k8s_img/kubeflow/${image_name##*/}.tar
# 删除Tag
docker rmi ${img_array[0]} ${image_name}
done

echo ""
echo "=========================================================="
echo "pull kubeflow v1.0 images from dockerhub finished."
echo "=========================================================="
echo ""

执行完以上脚本后发现knative-serving还是没有启动,原因是因为knative镜像使用的是sha256,没有使用具体版本号,而镜像下载到阿里云之后sha256就变更了,所以无法下载,只能通过修改配置文件更改下载地址了

1
vim /data/my-kubeflow/kustomize/knative-install/base/deployment.yaml

镜像地址如下格式

1
2
3
4
5
6
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:8e606671215cc029683e8cd633ec5de9eabeaa6e9a4392ff289883304be1f418
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler-hpa@sha256:5e0fadf574e66fb1c893806b5c5e5f19139cc476ebf1dff9860789fe4ac5f545
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:ef1f01b5fb3886d4c488a219687aac72d28e72f808691132f658259e4e02bb27
gcr.io/knative-releases/knative.dev/serving/cmd/networking/istio@sha256:727a623ccb17676fae8058cb1691207a9658a8d71bc7603d701e23b1a6037e6c
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:1ef3328282f31704b5802c1136bd117e8598fd9f437df8209ca87366c5ce9fcb
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5ca13e5b3ce5e2819c4567b75c0984650a57272ece44bc1dabf930f9fe1e19a1

改成

1
2
3
4
5
6
smartliby/activator:latest
smartliby/autoscaler-hpa:latest
smartliby/autoscaler:latest
smartliby/istio:latest
smartliby/webhook:latest
smartliby/controller:latest

最后执行kfctl apply -V -f ${CONFIG_URI}重新安装即可

如果还是有pod无法启动,可通过kubectl describe pod 未启动pod的名称 -n kubeflow查看原因

如果是因为镜像无法下载,可以将依赖的镜像加到上面的脚本里下载镜像

如果是因为镜像拉取策略导致每次都重新下载问题,可通过下面命令或者kubernetes-dashboard修改,将Always 改为 IfNotPresent

1
kubectl edit pod 未启动pod的名称 -n kubeflow

执行 kubectl get pods --namespace kubeflow查看kubeflow的pod都已运行起来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
NAME                                                           READY   STATUS      RESTARTS   AGE
admission-webhook-bootstrap-stateful-set-0 1/1 Running 6 2d4h
admission-webhook-deployment-569558c8b6-n8b7k 1/1 Running 0 13h
application-controller-stateful-set-0 1/1 Running 3 2d4h
argo-ui-7ffb9b6577-w8pb7 1/1 Running 7 3d7h
centraldashboard-659bd78c-fxgqd 1/1 Running 3 3d7h
jupyter-web-app-deployment-878f9c988-xgh82 1/1 Running 3 2d5h
katib-controller-7f58569f7d-8bw7z 1/1 Running 4 3d7h
katib-db-manager-54b66f9f9d-ngqw9 1/1 Running 3 3d7h
katib-mysql-dcf7dcbd5-7wbck 1/1 Running 12 4d1h
katib-ui-6f97756598-4mtjs 1/1 Running 3 3d7h
kfserving-controller-manager-0 2/2 Running 7 2d5h
metacontroller-0 1/1 Running 5 3d7h
metadata-db-65fb5b695d-wq8vh 1/1 Running 12 4d1h
metadata-deployment-65ccddfd4c-vwfd2 1/1 Running 3 3d7h
metadata-envoy-deployment-7754f56bff-svtz2 1/1 Running 3 3d7h
metadata-grpc-deployment-75f9888cbf-zj4sn 1/1 Running 5 3d7h
metadata-ui-7c85545947-v68l7 1/1 Running 3 3d7h
minio-69b4676bb7-w96xk 1/1 Running 12 4d1h
ml-pipeline-5cddb75848-bsc48 1/1 Running 3 2d6h
ml-pipeline-ml-pipeline-visualizationserver-7f6fcb68c8-vxjj7 1/1 Running 3 2d7h
ml-pipeline-persistenceagent-6ff9fb86dc-dvxx4 1/1 Running 5 2d6h
ml-pipeline-scheduledworkflow-7f84b54646-ndxcb 1/1 Running 3 2d7h
ml-pipeline-ui-6758f58868-gqvlp 1/1 Running 3 2d6h
ml-pipeline-viewer-controller-deployment-685874bc58-jljw8 1/1 Running 3 2d5h
mysql-6bcbfbb6b8-xmphz 1/1 Running 12 4d1h
notebook-controller-deployment-7db7c8589d-mlgb4 1/1 Running 3 2d5h
profiles-deployment-56b7c6788f-kk8kh 2/2 Running 6 2d7h
pytorch-operator-cf8c5c497-nmfnv 1/1 Running 7 3d7h
seldon-controller-manager-6b4b969447-qp7l4 1/1 Running 20 4d1h
spark-operatorcrd-cleanup-rrpxd 0/2 Completed 0 3d7h
spark-operatorsparkoperator-76dd5f5688-kn28n 1/1 Running 3 3d7h
spartakus-volunteer-5dc96f4447-xjclm 1/1 Running 3 3d7h
tensorboard-5f685f9d79-9x549 1/1 Running 12 4d1h
tf-job-operator-5fb85c5fb7-lqvrg 1/1 Running 6 3d7h
workflow-controller-689d6c8846-znvt9 1/1 Running 12 4d1h

执行如下命令进行端口映射访问Kubeflow UI

1
nohup kubectl port-forward -n istio-system svc/istio-ingressgateway 8088:80 &

访问http://127.0.0.1:8088/

创建Jupyter notebook server

点击连接之后就可以跑模型训练了

测试Jupyter

创建Python 3 notebook,执行如下代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

import tensorflow as tf

x = tf.placeholder(tf.float32, [None, 784])

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

y = tf.nn.softmax(tf.matmul(x, W) + b)

y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("Accuracy: ", sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

运行结果如下

1
Accuracy:  0.9012

测试pipelines

欢迎关注我的微信公众号,订阅最新文章!
🐶 您的支持将鼓励我继续创作 🐶
0%