Training¶

This page serves a guide on the different options to train a model in the platform.

Training options¶

There are currently three main options to train a model in the platform:

standard mode: you are given access to a persistent deployment that you can interact with via an IDE (ie. VScode).
batch mode: you deploy a temporary job that runs your training and then is killed when the training is completed
federated mode: you deploy a federated learning server that orchestrates the training. Then you can have several clients joining forces to distribute the training load among all of them.

All these options have the respective pros and cons.

Training options from the Dashboard¶
Option	✅ Pros	❌ Cons
Standard mode (persistent deployment)	It’s very convenient to edit and run directly from an IDE,	The GPU is dedicated to you full time, even when you are not actually training. Thus resources are not used optimally.
Batch mode (temporary jobs)	You are only using resources for as long as your training needs to run. To promote the usage of batch mode between users, we have dedicated Tesla V100 GPU nodes exclusively devoted to batch mode.	Less convenient because you cannot debug from IDE.
Federated mode	You can scale your training across many deployments, mixing both GPU and CPU deployments, both inside and outside the platform. Your training data remain local, so it’s a perfect match for privacy respecting usecases (eg. healthcare).	Not all modules can be used for default in federated mode, you need to adapt their code first. It might require a bit more work to setup.

Given the above specifications, we recommend the following typical workflows:

Use standard mode for you preliminary trainings, when you still might need to have direct access to the code/data to debug things.
Use batch mode when your training script is stable, and you are basically tweaking hyperparameters.
Use federated mode if you have sensitive data and/or need to distribute you training across many machines.

Note

Bear in mind that all the modules of the platform are fully open-source, as they are packaged as Docker containers. So if you have access to your own HPC resources, you can always take the container and train it there using udocker.

Training¶

Training options¶

Additional training-related tasks¶