Component Overview

Deployment Architecture Overview

Deployment Overview: Relation between Worker and Master

mlbench consists of two components, the Master and the Worker Docker containers.

Master

The master contains the Dashboard, the main interface for the project. The dashboard allows you to start and run a distributed ML experiment and visualizes the progress and result of the experiment. It allows management of the mlbench nodes in the Kubernetes cluster and for most users constitutes the sole way they interact with the mlbench project.

It also contains a REST API that can be use instead of the Dashboard, as well as being used for receiving data from the Dashboard.

The Master also manages the StatefulSet of worker through the Kubernetes API.

Dashboard

MLBench comes with a dashboard to manage and monitor the cluster and jobs.

Main Page

Dashboard Main Page

Dashboard Main Page

The main view shows all MLBench worker nodes and their current status

Runs Page

Dashboard Runs Page

Dashboard Runs Page

The Runs page allows you to start a new experiment on the worker nodes. You can select how many workers to use, how many CPU Cores each worker can utilize and limit the network bandwidth available to workers.

Run Details Page

Dashboard Run Details Page

Dashboard Run Details Page

The Run Details page shows the progress and result of an experiment. You can track metrics like train loss and validation accuracy as well as see the stdout and stderr logs of all workers.

It also allows you to download all the metrics of a run as well as resource usage of all workers participating in the run as json files.

REST API

MLBench provides a basic REST Api though which most functionality can also be used. It’s accessible through the /api/ endpoints on the dashboard URL.

Pods

GET /api/pods/

All Worker-Pods available in the cluster, including status information

Example request:

GET /api/pods HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

[
  {
    "name":"worn-mouse-mlbench-worker-55bbdd4d8c-4mxh5",
    "labels":"{'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '1166880847', 'release': 'worn-mouse'}",
    "phase":"Running",
    "ip":"10.244.2.58"
  },
  {
    "name":"worn-mouse-mlbench-worker-55bbdd4d8c-bwwsp",
    "labels":"{'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '1166880847', 'release': 'worn-mouse'}",
    "phase":"Running",
    "ip":"10.244.3.57"
  }
]
Request Headers:
 
  • Accept – the response content type depends on Accept header
Response Headers:
 
Status Codes:

Metrics

GET /api/metrics/

Get metrics (Cpu, Memory etc.) for all Worker Pods

Example request:

GET /api/metrics HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

{
  "quiet-mink-mlbench-worker-0": {
      "container_cpu_usage_seconds_total": [
          {
              "date": "2018-08-03T09:21:38.594282Z",
              "value": "0.188236813"
          },
          {
              "date": "2018-08-03T09:21:50.244277Z",
              "value": "0.215950298"
          }
      ]
  },
  "quiet-mink-mlbench-worker-1": {
      "container_cpu_usage_seconds_total": [
          {
              "date": "2018-08-03T09:21:29.347960Z",
              "value": "0.149286015"
          },
          {
              "date": "2018-08-03T09:21:44.266181Z",
              "value": "0.15325329"
          }
      ],
      "container_cpu_user_seconds_total": [
          {
              "date": "2018-08-03T09:21:29.406238Z",
              "value": "0.1"
          },
          {
              "date": "2018-08-03T09:21:44.331823Z",
              "value": "0.1"
          }
      ]
  }
}
Request Headers:
 
  • Accept – the response content type depends on Accept header
Response Headers:
 
Status Codes:
GET /api/metrics/(str: pod_name_or_run_id)/

Get metrics (Cpu, Memory etc.) for all Worker Pods

Example request:

GET /api/metrics HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

{
  "container_cpu_usage_seconds_total": [
      {
          "date": "2018-08-03T09:21:29.347960Z",
          "value": "0.149286015"
      },
      {
          "date": "2018-08-03T09:21:44.266181Z",
          "value": "0.15325329"
      }
  ],
  "container_cpu_user_seconds_total": [
      {
          "date": "2018-08-03T09:21:29.406238Z",
          "value": "0.1"
      },
      {
          "date": "2018-08-03T09:21:44.331823Z",
          "value": "0.1"
      }
  ]
}
Query Parameters:
 
  • since – only get metrics newer than this date, (Default 1970-01-01T00:00:00.000000Z)
  • metric_type – one of pod or run to determine what kind of metric to get (Default: pod)
Request Headers:
 
  • Accept – the response content type depends on Accept header
Response Headers:
 
Status Codes:
POST /api/metrics

Save metrics. “pod_name” and “run_id” are mutually exclusive. The fields of metrics and their types are defined in mlbench/api/models/kubemetrics.py.

Example request:

POST /api/metrics HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

{
  "pod_name": "quiet-mink-mlbench-worker-1",
  "run_id": 2,
  "name": "accuracy",
  "date": "2018-08-03T09:21:44.331823Z",
  "value": "0.7845",
  "cumulative": False,
  "metadata": "some additional data"
}

Example response:

HTTP/1.1 201 CREATED
Vary: Accept
Content-Type: text/javascript

{
  "pod_name": "quiet-mink-mlbench-worker-1",
  "name": "accuracy",
  "date": "2018-08-03T09:21:44.331823Z",
  "value": "0.7845",
  "cumulative": False,
  "metadata": "some additional data"
}
Request Headers:
 
  • Accept – the response content type depends on Accept header
Response Headers:
 
Status Codes:

Runs

GET /api/runs/

Gets all active/failed/finished runs

Example request:

GET /api/runs/ HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

[
  {
    "id": 1,
    "name": "Name of the run",
    "created_at": "2018-08-03T09:21:29.347960Z",
    "state": "STARTED",
    "job_id": "5ec9f286-e12d-41bc-886e-0174ef2bddae",
    "job_metadata": {...}
  },
  {
    "id": 2,
    "name": "Another run",
    "created_at": "2018-08-02T08:11:22.123456Z",
    "state": "FINISHED",
    "job_id": "add4de0f-9705-4618-93a1-00bbc8d9498e",
    "job_metadata": {...}
  },
]
Request Headers:
 
  • Accept – the response content type depends on Accept header
Response Headers:
 
Status Codes:
GET /api/runs/(int: run_id)/

Gets a run by id

Example request:

GET /api/runs/1/ HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

{
  "id": 1,
  "name": "Name of the run",
  "created_at": "2018-08-03T09:21:29.347960Z",
  "state": "STARTED",
  "job_id": "5ec9f286-e12d-41bc-886e-0174ef2bddae",
  "job_metadata": {...}
}

:run_id The id of the run

Request Headers:
 
  • Accept – the response content type depends on Accept header
Response Headers:
 
Status Codes:
POST /api/runs/

Starts a new Run

Example request:

POST /api/runs/ HTTP/1.1
Host: example.com
Accept: application/json, text/javascript
Request JSON Object:
 
  • name (string) – Name of the run
  • num_workers (int) – Number of worker nodes for the run
  • num_cpus (json) – Number of Cores utilized by each worker
  • max_bandwidth (json) – Maximum egress bandwidth in mbps for each worker

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

{
  "id": 1,
  "name": "Name of the run",
  "created_at": "2018-08-03T09:21:29.347960Z",
  "state": "STARTED",
  "job_id": "5ec9f286-e12d-41bc-886e-0174ef2bddae",
  "job_metadata": {...}
}
Request Headers:
 
  • Accept – the response content type depends on Accept header
Response Headers:
 
Status Codes:

Worker

The worker image contains all the boilerplate code needed for a distributed ML model and as well as the actual model code. It takes care of training the distributed model, depending on the settings provided by the Master. So the Master informs the Worker nodes what experiment the user would like to run and they run the relevant code by themselves.

Worker nodes send status information to the metrics API of the Master to inform it of the progress and state of the current run.