Installation¶

Make sure to read Prerequisites before installing mlbench.

All guides assume you have checked out the mlbench github repository and have a terminal open in the checked-out mlbench directory.

Helm Chart values¶

Since every Kubernetes is different, there are no reasonable defaults for some values, so the following properties have to be set. You can save them in a yaml file of your chosing. This guide will assume you saved them in myvalues.yaml. For a reference file for all configurable values, you can copy the charts/mlbench/values.yaml file to myvalues.yaml.

limits:
  workers:
  cpu:
  bandwidth:
  gpu:

gcePersistentDisk:
  enabled:
  pdName:

limits.workers is the maximum number of worker nodes available to mlbench. This sets the maximum number of nodes that can be chosen for an experiment in the UI. By default mlbench starts 2 workers on startup.
limits.cpu is the maximum number of CPUs (Cores) available on each worker node. Uses Kubernetes notation (8 or 8000m for 8 cpus/cores). This is also the maximum number of Cores that can be selected for an experiment in the UI
limits.bandwidth is the maximum network bandwidth available between workers, in mbit per second. This is the default bandwidth used and the maximum number selectable in the UI.
limits.gpu is the number of gpus requested by each worker pod.
gcePersistentDisk.enabled create resources related to NFS persistentVolume and persistentVolumeClaim.
gcePersistentDisk.pdName is the name of persistent disk existed in GKE.

Caution

If you set workers, cpu or gpu higher than available in your cluster, Kubernetes will not be able to allocate nodes to mlbench and the deployment will hang indefinitely, without throwing an exception. Kubernetes will just wait until nodes that fit the requirements become available. So make sure your cluster actually has the requirements avilable that you requested.

Note

To use gpu in the cluster, the nvidia device plugin should be installed. See Plugins for details

Note

Use commands like gcloud compute disks create --size=10G --zone=europe-west1-b my-pd-name to create persistent disk.

Note

The GCE persistent disk will be mounted to /datasets/ directory on each worker.

Basic Install¶

Set the Helm Chart values

Use helm to install the mlbench chart (Replace ${RELEASE_NAME} with a name of your choice):

$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install ${RELEASE_NAME} charts/mlbench

Follow the instructions at the end of the helm install to get the dashboard URL. E.g.:

$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install rel charts/mlbench
  [...]
  NOTES:
  1. Get the application URL by running these commands:
     export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services rel-mlbench-master)
     export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
     echo http://$NODE_IP:$NODE_PORT

This outputs the URL the Dashboard is accessible at.

Plugins¶

In values.yaml, one can optionally install Kubernetes plugins by turning on/off the following flags:

weave.enabled: If true, install the weave network plugin.
nvidiaDevicePlugin.enabled: If true, install the nvidia device plugin.

Google Cloud / Google Kubernetes Engine¶

Set the Helm Chart values

Important

Make sure to read the prerequisites for Google Cloud

Please make sure that kubectl is configured correctly.

Caution

Google installs several pods on each node by default, limiting the available CPU. This can take up to 0.5 CPU cores per node. So make sure to provision VM’s that have at least 1 more core than the amount of cores you want to use for you mlbench experiment. See here for further details on node limits.

Install mlbench (Replace ${RELEASE_NAME} with a name of your choice):

$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install ${RELEASE_NAME} charts/mlbench

To access mlbench, run these commands and open the URL that is returned (Note: The default instructions returned by helm on the commandline return the internal cluster ip only):

$ export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services ${RELEASE_NAME}-mlbench-master)
$ export NODE_IP=$(gcloud compute instances list|grep $(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}") |awk '{print $5}')
$ gcloud compute firewall-rules create --quiet mlbench --allow tcp:$NODE_PORT,tcp:$NODE_PORT
$ echo http://$NODE_IP:$NODE_PORT

Danger

The last command opens up a firewall rule to the google cloud. Make sure to delete the rule once it’s not needed anymore:

$ gcloud compute firewall-rules delete --quiet mlbench

Hint

If you want to build the docker images yourself and host it in the GC registry, follow these steps:

Authenticate with GC registry:

$ gcloud auth configure-docker

Build docker images (Replace <gcloud project name> with the name of your project):

$ make publish-docker component=master docker_registry=gcr.io/<gcloud project name>
$ make publish-docker component=worker docker_registry=gcr.io/<gcloud project name>

Use the following settings for your myvalues.yaml file when installing with helm:

master:

  image:
    repository: gcr.io/<gcloud project name>/mlbench_master
    tag: latest
    pullPolicy: Always


worker:

  image:
    repository: gcr.io/<gcloud project name>/mlbench_worker
    tag: latest
    pullPolicy: Always

Minikube¶

Minikube allows running a single-node Kubernetes cluster inside a VM on your laptop, for users looking to try out Kubernetes or to develop with it.

Installing mlbench to minikube.

Set the Helm Chart values

First build docker images and push them to private registry localhost:5000.

$ make publish-docker component=master docker_registry=localhost:5000
$ make publish-docker component=worker docker_registry=localhost:5000

Then start minikube cluster

$ minikube start

Use tcp-proxy to forward node’s 5000 port to host’s port 5000 so that one can pull images from local registry.

$ minikube ssh
$ docker run --name registry-proxy -d -e LISTEN=':5000' -e TALK="$(/sbin/ip route|awk '/default/ { print $3 }'):5000" -p 5000:5000 tecnativa/tcp-proxy

Now we can pull images from private registry inside the cluster, check docker pull localhost:5000/mlbench_master:latest.

Next install or upgrade a helm chart with desired configurations with name ${RELEASE_NAME}

$ helm init --kube-context minikube --wait
$ helm upgrade --wait --recreate-pods -f myvalues.yaml --timeout 900 --install ${RELEASE_NAME} charts/mlbench

Note

The minikube runs a single-node Kubernetes cluster inside a VM. So we need to fix the replicaCount=1 in values.yaml.

Once the installation is finished, one can obtain the url

$ export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services ${RELEASE_NAME}-mlbench-master)
$ export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
$ echo http://$NODE_IP:$NODE_PORT

Now the mlbench dashboard should be available at http://${NODE_IP}:${NODE_PORT}.

Note

To access http://$NODE_IP:$NODE_PORT outside minikube, run the following command on the host:

$ ssh -i ${MINIKUBE_HOME}/.minikube/machines/minikube/id_rsa -N -f -L localhost:${NODE_PORT}:${NODE_IP}:${NODE_PORT} docker@$(minikube ip)

where $MINIKUBE_HOME is by default $HOME. One can view mlbench dashboard at http://localhost:${NODE_PORT}

Docker-in-Docker (DIND)¶

Docker-in-Docker allows simulating multiple nodes locally on a single machine. This is useful for development.

Hint

For development purposes, it makes sense to use a local docker registry as well with DIND.

Describing how to set up a local registry would be too long for this guide, so here are some pointers:

You can find a guide here.
This page details setting up an image pull secret.
This details adding an image pull secret to a kubernetes service account.
You can use dind-proxy.sh in the mlbench repository to forward the registry port (5000) to kubernetes DIND.

Download the kubeadm-dind-cluster script.

$ wget https://cdn.rawgit.com/kubernetes-sigs/kubeadm-dind-cluster/master/fixed/dind-cluster-v1.11.sh
$ chmod +x dind-cluster-v1.11.sh

For networking to work in DIND, we need to set a CNI Plugin. In our experience, weave works well with DIND.

$ export CNI_PLUGIN=weave

Now we can start the local cluster with

$ ./dind-cluster-v1.11.sh up

This might take a couple of minutes.

Hint

If you’re using a local docker registry, run dind-proxy.sh after the previous step.

Install helm (See Prerequisites) and set the Helm Chart values.

Hint

For a local registry, build and push the master and worker images:

$ make publish-docker component=master docker_registry=localhost:5000
$ make publish-docker component=worker docker_registry=localhost:5000

Also, make sure you have an imagePullSecret added to the kubernetes serviceaccount and set the repository and secret in the values.yaml file (regcred in this example):

master:
  imagePullSecret: regcred

  image:
    repository: localhost:5000/mlbench_master
    tag: latest
    pullPolicy: Always


worker:
  imagePullSecret: regcred

  image:
    repository: localhost:5000/mlbench_worker
    tag: latest
    pullPolicy: Always

Install mlbench (Replace ${RELEASE_NAME} with a name of your choice):

$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install rel charts/mlbench
  [...]
  NOTES:
  1. Get the application URL by running these commands:
     export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services rel-mlbench-master)
     export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
     echo http://$NODE_IP:$NODE_PORT

Run the 3 commands printed by the last command. This outputs the URL the Dashboard is accessible at.