PyTorch Operator
Kubeflow PyTorch-Job Training Operator
PyTorch is a Python package that provides two high-level features:
    Tensor computation (like NumPy) with strong GPU acceleration
    Deep neural networks built on a tape-based autograd system
You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed. More information at https://github.com/kubeflow/pytorch-operator **or the PyTorch site https://pytorch.org/

Quick Start

As usual, let's deploy PyTorch with one single line command
1
k3ai apply pytorch-op
Copied!

Test You PyTorch-Job installation

We will use the MNISE example from the Kubeflow PyTorch-Job repo at https://github.com/kubeflow/pytorch-operator/tree/master/examples/mnist****
As usual, we want to avoid complexity so we re-worked a bit the sample and make it way much more easier.

Step 1

You'll see tha in the example a container need to be created before running the sample, we merged the container commands directly in the YAML file so now it's one-click job.
For CPU only
1
kubectl apply -f - << EOF
2
apiVersion: "kubeflow.org/v1"
3
kind: "PyTorchJob"
4
metadata:
5
name: "pytorch-dist-mnist-gloo"
6
namespace: kubeflow
7
spec:
8
pytorchReplicaSpecs:
9
Master:
10
replicas: 1
11
restartPolicy: OnFailure
12
template:
13
metadata:
14
annotations:
15
sidecar.istio.io/inject: "false"
16
spec:
17
containers:
18
- name: pytorch
19
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
20
command: ['sh','-c','pip install tensorboardX==1.6.0 && mkdir -p /opt/mnist/src && cd /opt/mnist/src && curl -O https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mnist/mnist.py && chgrp -R 0 /opt/mnist && chmod -R g+rwX /opt/mnist && python /opt/mnist/src/mnist.py']
21
args: ["--backend", "gloo"]
22
23
Worker:
24
replicas: 1
25
restartPolicy: OnFailure
26
template:
27
metadata:
28
annotations:
29
sidecar.istio.io/inject: "false"
30
spec:
31
containers:
32
- name: pytorch
33
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
34
command: ['sh','-c','pip install tensorboardX==1.6.0 && mkdir -p /opt/mnist/src && cd /opt/mnist/src && curl -O https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mnist/mnist.py && chgrp -R 0 /opt/mnist && chmod -R g+rwX /opt/mnist && python /opt/mnist/src/mnist.py']
35
args: ["--backend", "gloo"]
36
EOF
Copied!
If you have GPU enabled you may run it this way
1
kubectl apply -f - << EOF
2
apiVersion: "kubeflow.org/v1"
3
kind: "PyTorchJob"
4
metadata:
5
name: "pytorch-dist-mnist-gloo"
6
namespace: kubeflow
7
spec:
8
pytorchReplicaSpecs:
9
Master:
10
replicas: 1
11
restartPolicy: OnFailure
12
template:
13
metadata:
14
annotations:
15
sidecar.istio.io/inject: "false"
16
spec:
17
containers:
18
- name: pytorch
19
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
20
command: ['sh','-c','pip install tensorboardX==1.6.0 && mkdir -p /opt/mnist/src && cd /opt/mnist/src && curl -O https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mnist/mnist.py && chgrp -R 0 /opt/mnist && chmod -R g+rwX /opt/mnist && python /opt/mnist/src/mnist.py']
21
args: ["--backend", "gloo"]
22
# Change the value of nvidia.com/gpu based on your configuration
23
resources:
24
limits:
25
nvidia.com/gpu: 1
26
Worker:
27
replicas: 1
28
restartPolicy: OnFailure
29
template:
30
metadata:
31
annotations:
32
sidecar.istio.io/inject: "false"
33
spec:
34
containers:
35
- name: pytorch
36
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
37
command: ['sh','-c','pip install tensorboardX==1.6.0 && mkdir -p /opt/mnist/src && cd /opt/mnist/src && curl -O https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mnist/mnist.py && chgrp -R 0 /opt/mnist && chmod -R g+rwX /opt/mnist && python /opt/mnist/src/mnist.py']
38
args: ["--backend", "gloo"]
39
# Change the value of nvidia.com/gpu based on your configuration
40
resources:
41
limits:
42
nvidia.com/gpu: 1
43
EOF
Copied!

Step 2

Check if pod are deployed correctly with
1
kubectl get pod -l pytorch-job-name=pytorch-dist-mnist-gloo -n kubeflow
Copied!
It should ouput something like this
1
NAME READY STATUS RESTARTS AGE
2
pytorch-dist-mnist-gloo-master-0 1/1 Running 0 2m26s
3
pytorch-dist-mnist-gloo-worker-0 1/1 Running 0 2m26s
Copied!

Step 3

Check logs result of your training job
1
kubectl logs -l pytorch-job-name=pytorch-dist-mnist-gloo -n kubeflow
Copied!
You should observe an output similar to this (since we are using 1 Master and 1 worker in this case)
1
Train Epoch: 1 [55680/60000 (93%)] loss=0.0341
2
Train Epoch: 1 [56320/60000 (94%)] loss=0.0357
3
Train Epoch: 1 [56960/60000 (95%)] loss=0.0774
4
Train Epoch: 1 [57600/60000 (96%)] loss=0.1186
5
Train Epoch: 1 [58240/60000 (97%)] loss=0.1927
6
Train Epoch: 1 [58880/60000 (98%)] loss=0.2050
7
Train Epoch: 1 [59520/60000 (99%)] loss=0.0642
8
9
accuracy=0.9660
10
11
Train Epoch: 1 [55680/60000 (93%)] loss=0.0341
12
Train Epoch: 1 [56320/60000 (94%)] loss=0.0357
13
Train Epoch: 1 [56960/60000 (95%)] loss=0.0774
14
Train Epoch: 1 [57600/60000 (96%)] loss=0.1186
15
Train Epoch: 1 [58240/60000 (97%)] loss=0.1927
16
Train Epoch: 1 [58880/60000 (98%)] loss=0.2050
17
Train Epoch: 1 [59520/60000 (99%)] loss=0.0642
18
19
accuracy=0.9660
Copied!
Last modified 10mo ago