Easy way for :Distributed Tensorflow

tensorflow/k8s : Tools for ML/Tensorflow on Kubernetes.

Architecture:

  • Run a job operator by Helm
  • Post your job just like a simple yaml file
  • Job Operator will help you to distributed to each node for computing result
    • PS
    • Worker

Sample Job Kubernetes YAML file

apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
  name: "example-job"
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: PS

Sample Distributed Tensorflow Code

Refer here.

Troubleshooting: Could not enable default namespace on Helm.

kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'      
helm init --service-account tiller --upgrade

GKE: Enable GPU on GKE

gcloud alpha container clusters create gpu-test \
    --project $PROJECT_ID \
    --zone $ZONE \
    --enable-kubernetes-alpha \
    --enable-cloud-logging \
    --enable-cloud-monitoring \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --machine-type n1-standard-1 \
    --cluster-version=1.8.4-gke.1 \
    --image-type $IMAGE_TYPE \
    --num-nodes 1 \
    --quiet

Trobleshooting:

  1. K8S alpha don’t support master version upgrade, so you need define k8s 1.8 when you create it. (default: 1.7.8)
  2. How to get GKE current support versions?
    1. gcloud container get-server-config

Evan

Attitude is everything