kubeflow install
기본 kubeflow 설치
- 환경변수 설정
export KF_NAME=handson-kubeflow
export BASE_DIR=/home/${USER}
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml"
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/master/kfdef/kfctl_k8s_istio.v1.0.1.yaml"
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/master/kfdef/kfctl_k8s_istio.v1.0.2.yaml"
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/master/kfdef/kfctl_istio_dex.v1.0.2.yaml"
export KF_NAME=kf-sh
export BASE_DIR=/home/sh
export KF_DIR=${BASE_DIR}/${KF_NAME}
mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl build -V -f ${CONFIG_URI}
export CONFIG_FILE=${KF_DIR}/kfctl_k8s_istio.v1.0.0.yaml
kfctl apply -V -f ${CONFIG_FILE}
- kubectl 설치
# kubectl 설치
curl -LO https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
# kubeflow install을 위해서 kfctl을 가져온다.
# https://github.com/kubeflow/kfctl/releases 에서 원하는 버전 설치
wget https://github.com/kubeflow/kfctl/releases/download/v1.0.2/kfctl_v1.0.2-0-ga476281_linux.tar.gz
# 압축 해제
tar -xvf kfctl_v1.0.2-0-ga476281_linux.tar.gz
# kubeflow 설치 위치 지정 - 환경 변수
export PATH=$PATH:$(pwd)
- kustomize 패키지 빌드하기
# /etc/kubernetes/manifests/kube-apiserver.yaml 에 아래 두줄 추가
- --service-account-issuer=kubernetes.default.svc
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
-> 에러발생
WARN[0122] Encountered error applying application cert-manager: (kubeflow.error): Code 500 with messageError error when creating "/tmp/kout023299073": Internal error occurred: failed calling webhook "webhook.io": the server is currently unable to handle the request filename="kustomize/kustomize.go:202"
WARN[0122] Will retry in 21 seconds.
- 해결1 예상원인 istio-system
# 검색
kubectl -n istio-system get pods -> 많이 죽어있음
kubectl describe namespace istio-system -> 상태 확인하기
(1)
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml"
kfctl apply -V -f ${CONFIG_URI}
(2) 연달아 실행
wget https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml
kfctl apply -V -f ./kfctl_k8s_istio.v1.0.0.yaml
Istio 란?
control plain 에 있던 Pilot, Mixer, Galley, Citadel, Gateway등의 컴포넌트가 pod 형태로 설치
Data Plane의 메인 프록시로 Envoy proxy를 사용하며 이를 컨트롤 해주는 Control Plane의 오픈소스 솔루션이 Istio
kubectl -n istio-system get svc istio-sidecar-injector
- 해결 2 istio-system 지우고 다시 시도
# 삭제
kubectl delete namespaces istio-system
kubectl delete apiservice v1beta1.webhook.cert-manager.io
kubectl delete namespace cert-manager
kubectl label default your-namespace istio-injection=disabled
-> 삭제가안됨 아래 내용에 삭제방법 기제
# 네임스페이스 생성
kubectl create namespace kubeflow-anonymous
# 다시적용해보기
- 해결 3
https://docs.projectcalico.org/getting-started/kubernetes/flannel/flannel
위 사이트대로 적용후 다시 설치 시도
- **istio-system 삭제** ``` # 아래명령어를 치면 응답이 안옴 kubectl delete ns istio-system # 종료상태에 걸린 네임스페이스 제거하기
-> 최종 NAMESPACE=istio-system kubectl get ns $NAMESPACE -o json > ${NAMESPACE}.json
아래 텍스트 파일을 생성해서 finalizers안의 내용을 전부 비우기
vi ${NAMESPACE}.json
지우기
kubectl replace –raw “/api/v1/namespaces/istio-system/finalize” -f ./istio-system.json
- snap을 이용해서 설치 후 사용해보기 <br>
snap 사용 x 버전문제 확인 완료 sudo yum install snapd sudo systemctl start snapd.service sudo ln -s /var/lib/snapd/snap /snap
sudo snap install microk8s –classic snap refresh microk8s –beta microk8s.enable dns storage dashboard
gpu가 있는경우에
microk8s.enable gpu
kubeflow 활성화
microk8s.enable kubeflow
- 포트 포워딩 하기
export NAMESPACE=istio-system kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80
kubectl port-forward -n istio-system service/istio-ingressgateway 8080:80 –address=0.0.0.0 -> 에러 발생 an error occurred forwarding 8081 -> 443: error forwarding port 443 to pod 959e20b90486ab491d4dec86c25c4756bf0ead30f81f2bcd9a6d0b02aa0181b5, uid : exit status 1: 2020/06/25 16:31:35 socat[17328] E connect(5, AF=2 127.0.0.1:443, 16): Connection refused 해결 kubectl create deployment nginx –image=nginx kubectl create service nodeport nginx –tcp=80:80
이슈 -> 파이프라인이 충돌로 죽어있음 kubeadm init --feature-gates CoreDNS=true 확인 kubectl -n kubeflow get pods --selector=app=ml-pipeline kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent kubectl logs -n kubeflow ml-pipeline-persistenceagent-645cb66874-qmj9l --previous
- 설치 확인 <br>
설치확인
kubectl -n kubeflow get all kubectl get pods -n istio-system kubectl get service -n istio-system
- 기타 확인 kubectl get po -n cert-manager kubectl describe po ~~~ -n cert-manager
kubectl get po -n istio-system sudo vi /etc/environment -> no procxy
프록시 열기 kubectl proxy –address 0.0.0.0 –accept-hosts ‘.*’
- kubeflow 설치 에러 하나씩 해결하기
- 대시보드 접속시 UNAVAILABLE:upstream connect error or disconnect/reset before headers. reset reason: connection failure
- 확인 kubectl get pod -n kubeflow NAME READY STATUS RESTARTS AGE admission-webhook-bootstrap-stateful-set-0 1/1 Running 0 11m admission-webhook-deployment-569558c8b6-gnl4k 1/1 Running 0 11m application-controller-stateful-set-0 1/1 Running 0 12m argo-ui-7ffb9b6577-gtv5x 1/1 Running 0 11m centraldashboard-659bd78c-hn7zr 1/1 Running 0 11m jupyter-web-app-deployment-679d5f5dc4-j7l89 1/1 Running 0 11m katib-controller-7f58569f7d-jqxb2 1/1 Running 1 11m katib-db-manager-54b66f9f9d-mnkld 0/1 CrashLoopBackOff 6 11m katib-mysql-dcf7dcbd5-dqldn 0/1 Pending 0 11m katib-ui-6f97756598-8fk75 1/1 Running 0 11m kfserving-controller-manager-0 2/2 Running 1 11m metacontroller-0 1/1 Running 0 11m metadata-db-65fb5b695d-fxpd4 0/1 Pending 0 11m metadata-deployment-65ccddfd4c-zjxr6 0/1 Running 0 11m metadata-envoy-deployment-7754f56bff-5bf7q 1/1 Running 0 11m metadata-grpc-deployment-75f9888cbf-ksq77 1/1 Running 4 11m metadata-ui-7c85545947-gwn9g 1/1 Running 0 11m minio-69b4676bb7-h9vgf 0/1 Pending 0 11m ml-pipeline-5cddb75848-rf2xk 1/1 Running 1 11m ml-pipeline-ml-pipeline-visualizationserver-7f6fcb68c8-x96sw 1/1 Running 0 11m ml-pipeline-persistenceagent-6ff9fb86dc-cqqr5 0/1 CrashLoopBackOff 3 11m ml-pipeline-scheduledworkflow-7f84b54646-7vhjm 1/1 Running 0 11m ml-pipeline-ui-6758f58868-tsz6h 1/1 Running 0 11m ml-pipeline-viewer-controller-deployment-745dbb444d-js4xw 1/1 Running 0 11m mysql-6bcbfbb6b8-kxddt 0/1 Pending 0 11m notebook-controller-deployment-5c55f5845b-4w2tl 1/1 Running 0 11m profiles-deployment-78f694bffb-brkgw 2/2 Running 0 11m pytorch-operator-cf8c5c497-9mdcd 1/1 Running 0 11m seldon-controller-manager-6b4b969447-hrv97 1/1 Running 0 11m spark-operatorcrd-cleanup-nch68 0/2 Completed 0 11m spark-operatorsparkoperator-76dd5f5688-bht8b 1/1 Running 0 11m spartakus-volunteer-5dc96f4447-l882f 1/1 Running 0 11m tensorboard-5f685f9d79-6mnbc 1/1 Running 0 11m tf-job-operator-5fb85c5fb7-sz7k2 1/1 Running 0 11m workflow-controller-689d6c8846-p56g7 1/1 Running 0 11m
1. spark 에러부터 확인하기 kubectl describe pod -n kubeflow spark-operatorcrd-cleanup-nch68 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 12m default-scheduler Successfully assigned kubeflow/spark-operatorcrd-cleanup-nch68 to master-node-40 Normal Pulled 12m kubelet, master-node-40 Container image "gcr.io/spark-operator/spark-operator:v1beta2-1.0.0-2.4.4" already present on machine Normal Created 12m kubelet, master-node-40 Created container delete-sparkapp-crd Normal Started 12m kubelet, master-node-40 Started container delete-sparkapp-crd Normal Pulled 12m kubelet, master-node-40 Container image "gcr.io/spark-operator/spark-operator:v1beta2-1.0.0-2.4.4" already present on machine Normal Created 12m kubelet, master-node-40 Created container delete-scheduledsparkapp-crd Normal Started 12m kubelet, master-node-40 Started container delete-scheduledsparkapp-crd Normal SandboxChanged 12m kubelet, master-node-40 Pod sandbox changed, it will be killed and re-created. Warning FailedCreatePodSandBox 12m kubelet, master-node-40 Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "spark-operatorcrd-cleanup-nch68": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:315: copying bootstrap data to pipe caused \"write init-p: broken pipe\"": unknown
-> 커널 업데이트 필요 https://myksb1223.github.io/develop_diary/2018/08/01/Centos-kernel-update.html
```
참고
- Istio 설명
https://gruuuuu.github.io/cloud/service-mesh-istio/# - Istio 관리, 이슈처리
https://github.com/istio/istio/issues/21058
https://github.com/kubeflow/kubeflow/issues/4762
https://success.docker.com/article/kubernetes-namespace-stuck-in-terminating
https://github.com/kubeflow/kubeflow/issues/4856 - gcp로 설치 - Istio document
https://www.kubeflow.org/docs/started/k8s/kfctl-k8s-istio/
https://www.kubeflow.org/docs/started/k8s/kfctl-istio-dex/#notes-on-the-configuration-file - snap 을 이용한 설치 추천
https://github.com/kubeflow/kubeflow/issues/4198
https://ubuntu.com/kubeflow/install