Resource Management

How to effectively manage compute resources?


  • Running complex deep learning models poses a very practical resource management problem: how to give every team the tools they need to train their models without requiring them to operate their own infrastructure?

  • The most primitive approach is to use spreadsheets that allow people to reserve what resources they need to use.

  • The next approach is to utilize a SLURM Workload Manager, a free and open-source job scheduler for Linux and Unix-like kernels.

  • A very standard approach these days is to use Docker alongside Kubernetes.

    • Docker is a way to package up an entire dependency stack in a lighter-than-a-Virtual-Machine package.

    • Kubernetes is a way to run many Docker containers on top of a cluster.

  • The last option is to use open-source projects.

    • Using Kubeflow allows you to run model training jobs at scale on containers with the same scalability of container orchestration that comes with Kubernetes.

    • Polyaxon is a self-service multi-user system, taking care of scheduling and managing jobs in order to make the best use of available cluster resources.

Last updated