写点什么

KubeFATE 部署多集群联邦学习平台 FATE

用户头像
亨利笔记
关注
发布于: 2020 年 05 月 27 日
KubeFATE 部署多集群联邦学习平台 FATE

题图摄于北京奥林匹克中心



目标



在之前的系列文章,已经介绍过如何使用KubeFATE部署FATE集群,并使用MiniKube部署一个测试的多方联邦学习环境。但是在一台服务器里部署两个FATE在效率,以及实际环境并不不合适。在真实的生产环境,每一方应该有独立的Kubernete环境,且各自的环境是完整的Kubernetes集群。本文将介绍一个更真实的生产环境部署例子:使用kubefate在两个Kubernetes集群上部署互通的两个FATE。这两个FATE可以完成各种联邦学习的任务。



完成后将会得到下图的结构:





先决条件



  1. 两个独立的Kubernetes集群 v1.16.0+;

  2. 两个集群都可以连接互联网也可以互通;

  3. 两个部署机器,分别可以执行两个集群的kubectl命令;

  4. 两个Kubernetes都已经部署了ingress-controller ;



本教程的两个Kubernetes集群我们分别称为【A集群】和【B集群



部署的两个FATE实例分别称为【PartyA】和【PartyB



开始

检查集群



A集群的信息



[deploy-A]$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master Ready master 191d v1.17.3 192.168.10.1 <none> CentOS Linux 7 (Core) 3.10.0-1062.12.1.el7.x86_64 docker://19.3.5
node-0 Ready <none> 191d v1.17.3 192.168.10.2 <none> CentOS Linux 7 (Core) 3.10.0-1062.12.1.el7.x86_64 docker://19.3.3
node-1 Ready <none> 191d v1.17.3 192.168.10.3 <none> CentOS Linux 7 (Core) 3.10.0-1062.12.1.el7.x86_64 docker://19.3.3



B集群的信息



[deploy-B]$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master Ready master 191d v1.17.3 192.168.9.1 <none> CentOS Linux 7 (Core) 3.10.0-1062.12.1.el7.x86_64 docker://19.3.5
node-0 Ready <none> 191d v1.17.3 192.168.9.2 <none> CentOS Linux 7 (Core) 3.10.0-1062.12.1.el7.x86_64 docker://19.3.5
node-1 Ready <none> 191d v1.17.3 192.168.9.3 <none> CentOS Linux 7 (Core) 3.10.0-1062.12.1.el7.x86_64 docker://19.3.5



下载KubeFATE



KubeFATE 的开源地址https://github.com/FederatedAI/KubeFATE



打开链接 https://github.com/FederatedAI/KubeFATE/releases



KubeFATE是容器化部署fate的最佳实践,项目更新较快,建议使用最新的releases。



下载 The KubeFATE release pack ,得到kubefate-k8s-v1.3.0-a.tar.gz这个文件,



解压下载的这个文件,得到一个可执行文件和四个yaml文件



# A集群解压
[deploy-A]$ tar -zxvf kubefate-k8s-v1.3.0-a.tar.gz
kubefate
cluster.yaml
config.yaml
kubefate.yaml
rbac-config.yaml



# B集群解压
[deploy-B]$ tar -zxvf kubefate-k8s-v1.3.0-a.tar.gz
kubefate
cluster.yaml
config.yaml
kubefate.yaml
rbac-config.yaml



在Kubernetes上安装Kubefate server



分别创建KubeFATE server的namespace 和RBAC权限



在A集群和B集群分别部署rbac-config.yaml



# A集群
[deploy-A]$ kubectl apply -f ./rbac-config.yaml
namespace/kube-fate created
serviceaccount/kubefate-admin created
clusterrolebinding.rbac.authorization.k8s.io/kubefate created



# B集群
[deploy-B]$ kubectl apply -f ./rbac-config.yaml
namespace/kube-fate created
serviceaccount/kubefate-admin created
clusterrolebinding.rbac.authorization.k8s.io/kubefate created



[^注意]: rbac-config.yaml包含namespace和权限配置,默认绑定cluster-admin这个角色(集群管理员也可以配置自己的角色)



部署KubeFATE server



在A集群和B集群分别部署kubefate.yaml



# A集群
[deploy-A]$ kubectl apply -f ./kubefate.yaml
deployment.apps/kubefate created
deployment.apps/mongo created
service/mongo created
service/kubefate created
ingress.extensions/kubefate created



# B集群
[deploy-B]$ kubectl apply -f ./kubefate.yaml
deployment.apps/kubefate created
deployment.apps/mongo created
service/mongo created
service/kubefate created
ingress.extensions/kubefate created



[^注意]: kubefate.yaml包含了kubefate server的一些基本配置,如有必要可以自己修改。



查看部署结果



# A集群
[deploy-A]$ kubectl get pod,ingress -n kube-fate
NAME READY STATUS RESTARTS AGE
pod/kubefate-5465d889bb-4hwz6 1/1 Running 0 3m54s
pod/mongo-7b66bf7d87-hdl5c 1/1 Running 0 3m54s
NAME HOSTS ADDRESS PORTS AGE
ingress.extensions/kubefate kubefate.net 10.184.104.151 80 3m53s



# B集群
[deploy-B]$ kubectl get pod,ingress -n kube-fate
NAME READY STATUS RESTARTS AGE
pod/kubefate-64d57cb855-lks2f 1/1 Running 0 3m54s
pod/mongo-56684d6c86-96s96 1/1 Running 0 3m54s
NAME HOSTS ADDRESS PORTS AGE
ingress.extensions/kubefate kubefate.net 10.184.104.151 80 3m53s



如果两个pod的状态都是Running说明KubeFATE server已经部署成功了。



配置ingress host



分别在AB两个部署机配置kubefate.net的ingress解析



[^注意]: 本教程使用的ingress-controller是ingress-nginx,,对外暴露端口是通过host network的方式,配置办法在[这里](https://Kubernetes.github.io/ingress-nginx/deploy/baremetal/#via-the-host-network),所以对ingress的地址解析是ingress-nginx的pod运行所对应的node Ip。



# A集群
# 查看ingress的pod运行的的node的名称
[deploy-A]$ kubectl get pod -A -o wide | grep ingress
kube-system nginx-ingress-controller-6fc5bcc8c9-l2vnx 1/1 Running 3 41h 172.17.0.2 node-0 <none> <none>



# B集群
# 查看ingress的pod运行的的node的名称
[deploy-B]$ kubectl get pod -A -o wide | grep ingress
kube-system nginx-ingress-controller-6d747b878f-9cdxt 1/1 Running 2 40h 172.17.0.2 node-0 <none> <none>



可以看到两边对应的node都是node-0这个节点,通过前面查看的集群信息可以知道:



  • A集群node-0的 ip是192.168.10.2

  • B集群node-0的 ip是192.168.9.2



分别设置对应的host



# A集群
[deploy-A]$ echo "192.168.10.2 kubefate.net" >> /etc/hosts



# B集群
[deploy-B]$ echo "192.168.9.2 kubefate.net" >> /etc/hosts



测试KubeFATE命令行工具与KubeFATE server的互通性



安装kubefate命令行工具



# A集群
[deploy-A]$ chmod +x ./kubefate && sudo mv ./kubefate /usr/local/bin/kubefate



# B集群
[deploy-B]$ chmod +x ./kubefate && sudo mv ./kubefate /usr/local/bin/kubefate



查看是否连通



# A集群
[deploy-A]$ kubefate version
* kubefate service version=v1.0.2 # 出现这一行表示kubefate server连通成功
* kubefate commandLine version=v1.0.2



# B集群
[deploy-B]$ kubefate version
* kubefate service version=v1.0.2 # 出现这一行表示kubefate server连通成功
* kubefate commandLine version=v1.0.2



分别修改需要部署的PartyA和PartyB的配置文件



cluster.yaml是kubefate部署fate实例的部署配置文件,包含了fate实例的一些基本信息,详细的介绍可以查看这里



PartyA的配置文件cluster-A.yaml



# A集群
name: fate-10000
namespace: fate-10000
chartName: fate
chartVersion: v1.3.0-a
partyId: 10000
modules:
- proxy
- egg
- federation
- metaService
- mysql
- redis
- roll
- python
proxy:
type: NodePort
nodePort: 30010
partyList:
- partyId: 9999
partyIp: 192.168.9.3
partyPort: 30009
egg:
count: 3



PartyB的配置文件cluster-B.yaml



# B集群
name: fate-9999
namespace: fate-9999
chartName: fate
chartVersion: v1.3.0-a
partyId: 9999
modules:
- proxy
- egg
- federation
- metaService
- mysql
- redis
- roll
- python
proxy:
type: NodePort
nodePort: 30009
partyList:
- partyId: 10000
partyIp: 192.168.10.3
partyPort: 30010
egg:
count: 3



[^注意]: partyList配置是互相配置对方的proxy入口



创建需要的namespace



# A集群
[deploy-A]$ kubectl create namespace fate-10000
namespace/fate-10000 created



# B集群
[deploy-B]$ kubectl create namespace fate-9999
namespace/fate-9999 created



部署FATE



使用KubeFATE命令行工具部署fate



# A集群
[deploy-A]$ kubefate cluster install -f ./cluster-A.yaml
create job success, job id=5b37a5a3-f33b-4967-ae0c-73bea5770c49



# B集群
[deploy-B]$ kubefate cluster install -f ./cluster-B.yaml
create job success, job id=5b1dfdc9-7744-4911-a2e4-009e0fdfcb49



查看部署状态



部署FATE的时候会创建一个部署的job,查看job的运行状态可以了解fate是否部署成功。



# A集群
[deploy-A]$ kubefate job describe 5b37a5a3-f33b-4967-ae0c-73bea5770c49
UUID 5b37a5a3-f33b-4967-ae0c-73bea5770c49
StartTime 2020-04-23 07:06:19
EndTime 2020-04-23 07:06:22
Status Success
Creator admin
ClusterId 0e41ff79-d3be-416d-a53c-af13a7bfdf58
Result Cluster install success
SubJobs []



# B集群
[deploy-B]$ kubefate job describe 5b1dfdc9-7744-4911-a2e4-009e0fdfcb49
UUID 5b1dfdc9-7744-4911-a2e4-009e0fdfcb49
StartTime 2020-04-24 02:58:10
EndTime 2020-04-24 02:58:13
Status Success
Creator admin
ClusterId d8a1f86e-ac9b-4218-b569-b9093b34c911
Result Cluster install success
SubJobs []



查看FATE的信息,了解fate的运行状态



# A集群
[deploy-A]$ kubefate cluster describe 0e41ff79-d3be-416d-a53c-af13a7bfdf58
UUID 0e41ff79-d3be-416d-a53c-af13a7bfdf58
Name fate-10000
NameSpace fate-10000
ChartName fate
ChartVersion v1.3.0-c
Revision 1
Status Running
Values {"chartName":"fate","chartVersion":"v1.3.0-c","egg":{"count":3},"modules":["proxy","egg","federation","metaService","mysql","redis","roll","python","client"],"name":"fate-10000","namespace":"fate-10000","partyId":10000,"proxy":{"nodePort":30010,"partyList":[{"partyId":9999,"partyIp":"192.168.9.3","partyPort":30009}],"type":"NodePort"}}
Config map[chartName:fate chartVersion:v1.3.0-c egg:map[count:3] modules:[proxy egg federation metaService mysql redis roll python client] name:fate-10000 namespace:fate-10000 partyId:10000 proxy:map[nodePort:30010 partyList:[map[partyId:9999 partyIp:192.168.9.3 partyPort:30009]] type:NodePort]]
Info map[dashboard:10000.fateboard.kubefate.net ip:192.168.10.3 modules:[client-6bdb7cd59d-whhx8 egg0-5b44548fbd-dwv7g egg1-685b57d7f5-ftpzg egg2-6687f8486b-f6nwn federation-6d799b5cfd-hrtnp meta-service-54db9f8fbc-6xv9q mysql-6bc77fc46c-fdlhp proxy-8d758c997-v6kq6 python-77bb96fd78-xnjgp redis-9546f56b-jbmxw roll-77dfbb54dc-2q2mb] port:30010]



# B集群
[deploy-B]$ kubefate cluster describe d8a1f86e-ac9b-4218-b569-b9093b34c911
UUID d8a1f86e-ac9b-4218-b569-b9093b34c911
Name fate-9999
NameSpace fate-9999
ChartName fate
ChartVersion v1.3.0-c
Revision 1
Status Running
Values {"chartName":"fate","chartVersion":"v1.3.0-c","egg":{"count":3},"modules":["proxy","egg","federation","metaService","mysql","redis","roll","python","client"],"name":"fate-9999","namespace":"fate-9999","partyId":9999,"proxy":{"nodePort":30009,"partyList":[{"partyId":10000,"partyIp":"192.168.10.3","partyPort":30010}],"type":"NodePort"}}
Config map[chartName:fate chartVersion:v1.3.0-c egg:map[count:3] modules:[proxy egg federation metaService mysql redis roll python client] name:fate-9999 namespace:fate-9999 partyId:9999 proxy:map[nodePort:30009 partyList:[map[partyId:10000 partyIp:192.168.10.3 partyPort:30010]] type:NodePort]]
Info map[dashboard:9999.fateboard.kubefate.net ip:192.168.9.3 modules:[client-7db8b9fb45-7scr7 egg0-79768fbffb-ndmq2 egg1-6bd6b965cf-vrkmz egg2-67d896f78-b7dz8 federation-547dfd8654-xmhcx meta-service-9c74f597f-t9zlw mysql-7788cc95-b24vz proxy-6687bffc77-hjj5x python-68df558bb6-bhxr7 redis-864f95f74-p86pv roll-bf7cf74c9-52qqj] port:30009]



[^注意]: 当Status显示为Running时代表fate运行正常。



配置FATE-Board的host



分别配置双方的fateboard和notebook的host,就像配置kubefate.net一样,我们只需要向hosts文件添加一行记录,就可以在浏览器查看使用fateboard和notebook



# A集群
[deploy-A]$ echo "192.168.10.2 10000.fateboard.kubefate.net" >> /etc/hosts



# B集群
[deploy-B]$ echo "192.168.9.2 9999.fateboard.kubefate.net" >> /etc/hosts



配置成功后可以在浏览器打开fateboard的url查看



10000.fateboard.kubefate.net



9999.fateboard.kubefate.net



类似如下的页面:



检查PartyA和PartyB双方的互通性



互相连通的两个party就可以完成各种联邦学习的任务,接下来我们通过toy_example来测试双方的互通性。



打开PartyA的python container,



# A集群
[deploy-A]$ kubectl exec -it svc/fateflow -c python -n fate-10000 -- bash
(venv) [root@python-77bb96fd78-xnjgp python]



执行toy_example脚本,



(venv) [root@python-77bb96fd78-xnjgp python]# cd examples/toy_example/
(venv) [root@python-77bb96fd78-xnjgp toy_example]# python run_toy_example.py 10000 9999 1
stdout:{
"data": {
"board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202004270647307032491&role=guest&party_id=10000",
"job_dsl_path": "/data/projects/fate/python/jobs/202004270647307032491/job_dsl.json",
"job_runtime_conf_path": "/data/projects/fate/python/jobs/202004270647307032491/job_runtime_conf.json",
"logs_directory": "/data/projects/fate/python/logs/202004270647307032491",
"model_info": {
"model_id": "guest-10000#host-9999#model",
"model_version": "202004270647307032491"
}
},
"jobId": "202004270647307032491",
"retcode": 0,
"retmsg": "success"
}
job status is running
job status is running
job status is running
...



如果出现下面的结果代表双方已经成功互通,



打开PartyA的FATE-Board(10000.fateboard.kubefate.net)可以查看更详细的任务信息。



到这里两个Kubernetes分别部署FATE实例就完成了。



发布于: 2020 年 05 月 27 日阅读数: 87
用户头像

亨利笔记

关注

自由自在 网际穿行 2020.04.25 加入

公众号:亨利笔记。作者是执着计算机技术爱好者。中国首个CNCF开源项目Harbor镜像仓库创始人。联邦学习FATE开源项目TSC成员。《Harbor权威指南》《区块链技术指南》作者之一。关注云原生、人工智能和区块链等领域。

评论

发布
暂无评论
KubeFATE 部署多集群联邦学习平台 FATE