My Profile Photo

Personal Webpage of Max Horn


I work as a Applied Scientist for AWS AI where I explore the intersection of Machine Learning, Deep Learning and Causality. In particular, I am interested building models that understand notions of objects present in the real world.
Prior to my position at AWS I was a PhD Student in Machine Learning and Computational Biology at ETH Zürich and worked on the development of deep learning methods for real world medical time series, Dimensionality Reduction and Topological Machine Learning.
My interests include but are not limited to: Machine Learning for Healthcare, Probabilistic Modelling, Time Series Modelling and Interpretable Machine Learning.
Here I write about stuff I care about in the realm of science, programming, technology and crypto. All opinions expressed are solely my own and do not express the views or opinions of my employer.


Project: simple-gpu-scheduler - easy scheduling of jobs on multiple GPUs

Our research group has multiple servers each equipped with multiple GPUs. Unfortunately, these are not connected together in a cluster infrastructure, but instead, GPUs are assigned to individuals or on a per-project basis. This makes the execution of many jobs using multiple GPUs difficult.

While it would be possible to connect the servers to a small cluster with a scheduling system (we are working on it!), this can take a long time until it is set up. Especially in academia where the maintenance and setup of servers is often delegated to the departments IT-team, the path to implementing a small scale cluster is littered with bureaucracy. Questions like: Who is responsible for xyz?, How are the software installations managed?, Which alterations should be done to have the correct network infrastructure? can take ages before they are answered and appropriately implemented. In our particular case we had the idea of refurbishing the cluster more than a year ago and are still no where close to having it up and running.

XKCD comic about networking problems

The Alternative - simple-gpu-scheduler

Driven by the need of having something as a bridge between our current server setup and the to be beautiful world of our personal cluster I decided to write a small Python package to do the job. This is how simple-gpu-scheduler was born.

How it works

Software based on the CUDA library (such as most deep learning frameworks and many others), can be constrained to only seeing certain GPUs using the CUDA_VISIBLE_DEVICES environment variable. The simple-gpu-scheduler accepts commands and executes them while setting the environment variable to a currently free GPU. As soon as the job finishes, the GPU is released and the next job is allocated to it. This allows to always utilize all of the GPUs to the maximally possible extent 1.

Usage

I wanted to make simple-gpu-scheduler as simple and flexible as possible and thus tried to adhere to the KISS principle. Like many UNIX tools it thus takes it’s input from stdin such that it can be combined with other tools. This allows reading commands from a list, or even from a fifo (first in first out), such we can build a fully functioning queuing system. For further reference please consult the GitHub page of the project.

Simple example

Suppose you have a file gpu_commands.txt with commands that you would like to execute on the GPUs 0, 1 and 2 in parallel:

$ cat gpu_commands.txt
python train_model.py --lr 0.001 --output run_1
python train_model.py --lr 0.0005 --output run_2
python train_model.py --lr 0.0001 --output run_3

Then you can do so by simply piping the command into the simple_gpu_scheduler script

$ simple_gpu_scheduler --gpus 0 1 2 < gpu_commands.txt
Processing command `python train_model.py --lr 0.001 --output run_1` on gpu 2
Processing command `python train_model.py --lr 0.0005 --output run_2` on gpu 1
Processing command `python train_model.py --lr 0.0001 --output run_3` on gpu 0

One of the most common use cases for running many jobs in parallel is hyperparameter search. For convenience I added a small script simple_hypersearch which generates commands to evaluate a hyperparameter grid. Here is a small example of how to generate all possible configurations and execute them in random order:

simple_hypersearch "python3 train_dnn.py --lr {lr} --batch_size {bs}" -p lr 0.001 0.0005 0.0001 -p bs 32 64 128 | simple_gpu_scheduler --gpus 0,1,2

Final words

I hope some of you find the software useful. Feel free to open issues and feature requests if you need any further features. See you next time!

  1. GNU parallel can be used to do something similar (see the HN discussion). It is significantly more flexible, which IMHO comes at the cost of ease of use. 

comments powered by Disqus