Training in AI4OS

This page serves a guide on the different options to train a model in the AI4OS platform.

Training options

There are currently three main options to train a model in the AI4OS platform:

  • standard mode: you are given access to a persistent deployment that you can interact with via an IDE (ie. VScode).

  • batch mode: you deploy a temporary job that runs your training and then is killed when the training is completed

  • federated mode: you deploy a federated learning server that orchestrates the training. Then you can have several clients joining forces to distribute the training load among all of them.

All these options have the respective pros and cons.

Training options from the AI4OS Dashboard

Option

✅ Pros

❌ Cons

Standard mode (persistent deployment)

  • It’s very convenient to edit and run directly from an IDE,

  • The GPU is dedicated to you full time, even when you are not actually training. Thus resources are not used optimally.

Batch mode (temporary jobs)

  • You are only using resources for as long as your training needs to run.

    To promote the usage of batch mode between users, we have dedicated Tesla V100 GPU nodes exclusively devoted to batch mode.

  • Less convenient because you cannot debug from IDE.

Federated mode

  • You can scale your training across many deployments, mixing both GPU and CPU deployments, both inside and outside the AI4OS platform.

  • Your training data remain local, so it’s a perfect match for privacy respecting usecases (eg. healthcare).

  • Not all modules can be used for default in federated mode, you need to adapt their code first.

  • It might require a bit more work to setup.

Given the above specifications, we recommend the following typical workflows:

  • Use standard mode for you preliminary trainings, when you still might need to have direct access to the code/data to debug things.

  • Use batch mode when your training script is stable, and you are basically tweaking hyperparameters.

  • Use federated mode if you have sensitive data and/or need to distribute you training across many machines.