Component Overview¶
mlbench consists of two components, the Master and the Worker Docker containers.
Master¶
The master contains the Dashboard, the main interface for the project. The dashboard allows you to start and run a distributed ML experiment and visualizes the progress and result of the experiment. It allows management of the mlbench nodes in the Kubernetes cluster and for most users constitutes the sole way they interact with the mlbench project.
It also contains a REST API that can be use instead of the Dashboard, as well as being used for receiving data from the Dashboard.
The Master also manages the StatefulSet
of worker through the Kubernetes API.
Dashboard¶
MLBench comes with a dashboard to manage and monitor the cluster and jobs.
Runs Page¶
The Runs page allows you to start a new experiment on the worker nodes. You can select how many workers to use, how many CPU Cores each worker can utilize and limit the network bandwidth available to workers.
Run Details Page¶
The Run Details page shows the progress and result of an experiment. You can track metrics like train loss
and validation accuracy
as well as see the stdout
and stderr
logs of all workers.
It also allows you to download all the metrics of a run as well as resource usage of all workers participating in the run as json files.
REST API¶
MLBench provides a basic REST Api though which most functionality can also be used. It’s accessible through the /api/ endpoints on the dashboard URL.
Pods¶
-
GET
/api/pods/
¶ All Worker-Pods available in the cluster, including status information
Example request:
GET /api/pods HTTP/1.1 Host: example.com Accept: application/json, text/javascript
Example response:
HTTP/1.1 200 OK Vary: Accept Content-Type: text/javascript [ { "name":"worn-mouse-mlbench-worker-55bbdd4d8c-4mxh5", "labels":"{'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '1166880847', 'release': 'worn-mouse'}", "phase":"Running", "ip":"10.244.2.58" }, { "name":"worn-mouse-mlbench-worker-55bbdd4d8c-bwwsp", "labels":"{'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '1166880847', 'release': 'worn-mouse'}", "phase":"Running", "ip":"10.244.3.57" } ]
Request Headers: - Accept – the response content type depends on Accept header
Response Headers: - Content-Type – this depends on Accept header of request
Status Codes: - 200 OK – no error
Metrics¶
-
GET
/api/metrics/
¶ Get metrics (Cpu, Memory etc.) for all Worker Pods
Example request:
GET /api/metrics HTTP/1.1 Host: example.com Accept: application/json, text/javascript
Example response:
HTTP/1.1 200 OK Vary: Accept Content-Type: text/javascript { "quiet-mink-mlbench-worker-0": { "container_cpu_usage_seconds_total": [ { "date": "2018-08-03T09:21:38.594282Z", "value": "0.188236813" }, { "date": "2018-08-03T09:21:50.244277Z", "value": "0.215950298" } ] }, "quiet-mink-mlbench-worker-1": { "container_cpu_usage_seconds_total": [ { "date": "2018-08-03T09:21:29.347960Z", "value": "0.149286015" }, { "date": "2018-08-03T09:21:44.266181Z", "value": "0.15325329" } ], "container_cpu_user_seconds_total": [ { "date": "2018-08-03T09:21:29.406238Z", "value": "0.1" }, { "date": "2018-08-03T09:21:44.331823Z", "value": "0.1" } ] } }
Request Headers: - Accept – the response content type depends on Accept header
Response Headers: - Content-Type – this depends on Accept header of request
Status Codes: - 200 OK – no error
-
GET
/api/metrics/
(str: pod_name_or_run_id)/
¶ Get metrics (Cpu, Memory etc.) for all Worker Pods
Example request:
GET /api/metrics HTTP/1.1 Host: example.com Accept: application/json, text/javascript
Example response:
HTTP/1.1 200 OK Vary: Accept Content-Type: text/javascript { "container_cpu_usage_seconds_total": [ { "date": "2018-08-03T09:21:29.347960Z", "value": "0.149286015" }, { "date": "2018-08-03T09:21:44.266181Z", "value": "0.15325329" } ], "container_cpu_user_seconds_total": [ { "date": "2018-08-03T09:21:29.406238Z", "value": "0.1" }, { "date": "2018-08-03T09:21:44.331823Z", "value": "0.1" } ] }
Query Parameters: - since – only get metrics newer than this date, (Default 1970-01-01T00:00:00.000000Z)
- metric_type – one of pod or run to determine what kind of metric to get (Default: pod)
Request Headers: - Accept – the response content type depends on Accept header
Response Headers: - Content-Type – this depends on Accept header of request
Status Codes: - 200 OK – no error
-
POST
/api/metrics
¶ Save metrics. “pod_name” and “run_id” are mutually exclusive. The fields of metrics and their types are defined in mlbench/api/models/kubemetrics.py.
Example request:
POST /api/metrics HTTP/1.1 Host: example.com Accept: application/json, text/javascript { "pod_name": "quiet-mink-mlbench-worker-1", "run_id": 2, "name": "accuracy", "date": "2018-08-03T09:21:44.331823Z", "value": "0.7845", "cumulative": False, "metadata": "some additional data" }
Example response:
HTTP/1.1 201 CREATED Vary: Accept Content-Type: text/javascript { "pod_name": "quiet-mink-mlbench-worker-1", "name": "accuracy", "date": "2018-08-03T09:21:44.331823Z", "value": "0.7845", "cumulative": False, "metadata": "some additional data" }
Request Headers: - Accept – the response content type depends on Accept header
Response Headers: - Content-Type – this depends on Accept header of request
Status Codes: - 201 Created – no error
Runs¶
-
GET
/api/runs/
¶ Gets all active/failed/finished runs
Example request:
GET /api/runs/ HTTP/1.1 Host: example.com Accept: application/json, text/javascript
Example response:
HTTP/1.1 200 OK Vary: Accept Content-Type: text/javascript [ { "id": 1, "name": "Name of the run", "created_at": "2018-08-03T09:21:29.347960Z", "state": "STARTED", "job_id": "5ec9f286-e12d-41bc-886e-0174ef2bddae", "job_metadata": {...} }, { "id": 2, "name": "Another run", "created_at": "2018-08-02T08:11:22.123456Z", "state": "FINISHED", "job_id": "add4de0f-9705-4618-93a1-00bbc8d9498e", "job_metadata": {...} }, ]
Request Headers: - Accept – the response content type depends on Accept header
Response Headers: - Content-Type – this depends on Accept header of request
Status Codes: - 200 OK – no error
-
GET
/api/runs/
(int: run_id)/
¶ Gets a run by id
Example request:
GET /api/runs/1/ HTTP/1.1 Host: example.com Accept: application/json, text/javascript
Example response:
HTTP/1.1 200 OK Vary: Accept Content-Type: text/javascript { "id": 1, "name": "Name of the run", "created_at": "2018-08-03T09:21:29.347960Z", "state": "STARTED", "job_id": "5ec9f286-e12d-41bc-886e-0174ef2bddae", "job_metadata": {...} }
:run_id The id of the run
Request Headers: - Accept – the response content type depends on Accept header
Response Headers: - Content-Type – this depends on Accept header of request
Status Codes: - 200 OK – no error
-
POST
/api/runs/
¶ Starts a new Run
Example request:
POST /api/runs/ HTTP/1.1 Host: example.com Accept: application/json, text/javascript
Request JSON Object: - name (string) – Name of the run
- num_workers (int) – Number of worker nodes for the run
- num_cpus (json) – Number of Cores utilized by each worker
- max_bandwidth (json) – Maximum egress bandwidth in mbps for each worker
Example response:
HTTP/1.1 200 OK Vary: Accept Content-Type: text/javascript { "id": 1, "name": "Name of the run", "created_at": "2018-08-03T09:21:29.347960Z", "state": "STARTED", "job_id": "5ec9f286-e12d-41bc-886e-0174ef2bddae", "job_metadata": {...} }
Request Headers: - Accept – the response content type depends on Accept header
Response Headers: - Content-Type – this depends on Accept header of request
Status Codes: - 200 OK – no error
- 409 Conflict – a run is already active
Worker¶
The worker image contains all the boilerplate code needed for a distributed ML model and as well as the actual model code. It takes care of training the distributed model, depending on the settings provided by the Master. So the Master informs the Worker nodes what experiment the user would like to run and they run the relevant code by themselves.
Worker nodes send status information to the metrics API of the Master to inform it of the progress and state of the current run.