Train a model remotely

This is a step by step guide on how to train a general model from the AI4OS Dashboard with your own dataset.

In this tutorial we will see how to retrain a generic image classifier on a custom dataset to create a phytoplankton classifier. If you want to follow along, you can download the toy phytoplankton dataset here.

If you are new to Machine Learning, you might want to check some useful Machine Learning resources we compiled to help you getting started.

Requirements

  • You need a Authentication to be able to access the Dashboard and Nextcloud storage.

  • For Step 7 we recommend having docker installed (though it’s not strictly mandatory).

1. Choose a module from the Marketplace

The first step is to choose a model from the AI4OS Dashboard. Make sure to select a module with the trainable tag. For educational purposes we are going to use a general model to identify images. Some of the model dependent details can change if using another model, but this tutorial will provide a general overview of the workflow to follow when using any of the modules in the AI4OS Dashboard.

2. Upload your files to Nextcloud

For this example we are going to use the AI4OS Nextcloud for storing the dataset you want to retrain the model with.

So login to Nextcloud with your credentials and you should access to an overview of your files. Now it’s time to upload your dataset. When training a model, the data has usually to be in a specific format and folder structure. It’s usually helpful to read the README in the source code of the module (in this case located here) to learn the correct way to setting it up.

In the case of the image classification module, we will create the following folders:

../../_images/nc-folders.png
  • A folder called models where the new training weights will be stored after the training is completed

  • A folder called data that contains two different folders:

    • The sub folder images containing the input images needed for the training

    • The sub folder dataset_files containing a couple of files:

      • train.txt indicating the relative path to the training images

      • classes.txt indicating which are the categories for the training

Again, the folder structure and their content will of course depend on the module to be used. This structure is just an example in order to complete the workflow for this tutorial.

Once you have prepared your data locally, you can drag your folder to the Nextcloud Web UI to upload it.

If you have your dataset in a remote machine, you will have to install rclone on your remote machine, configure it and do an rclone copy to move your data to Nextcloud.

Tip

Uploading to Nextcloud can be particularly slow if your dataset is composed of lots of small files. Considering zipping your folder before uploading.

$ zip -r <foldername>.zip <foldername>
$ unzip <foldername>.zip

3. Deploy with the Training Dashboard

Now go to the AI4OS Dashboard and login with your credentials. Then go to (1) Modules (marketplace) ➜ (2) Train image classifier ➜ (3) Train module.

Now you will be presented with a configuration form. For the purposes of running a retraining, it should be filled as following:

  1. In the General configuration you should select:

  • Template = default (with storage options), unless stated otherwise in your modules README.

  • Command = JupyterLab because we want the flexibility of being able to interact with the code and the terminal, not just the API.

  • Hardware configuration = GPU because training is a very resource consuming task.

  • Docker tag = gpu because Docker tag has to match the hardware it will be run on.

  1. Once this is set, you can proceed to fill the Specific configuration:

  • jupyter password, you have to provide a password at least 9 characters long, so that nobody will be able to access your machine, which will be exposed on a public IP.

  • rclone_user, rclone_password: those are the credentials to be able to mount your Nextcloud directory in your deployment. Go here in order to find how to create them.

Now that you are done configuring, click Submit to create the deployment. See the Dashboard guide for more details.

4. Go to JupyterLab and mount your dataset

After submitting you will be redirected to the deployment’s list. In your new deployment go to Access and choose JupyterLab. You will be redirected to http://jupyterlab_endpoint

Now that you are in JupyterLab, open a Terminal window ( (New launcher) ➜ OthersTerminal).

First let’s check we are seeing our GPU correctly:

$ nvidia-smi

This should output the GPU model along with some extra info.

Then configure rclone. We can also check rclone is correctly configured with:

$ rclone about rshare:

which should output your used space in Nextcloud.

Tip

If you happen to need additional packages, you will have to update the package index first. Note that sudo is not needed as you are always root in your Docker containers:

$ apt update
$ apt install vim

Now we will mount our remote Nextcloud folders in our local containers:

$ rclone copy rshare:/data/dataset_files /srv/image-classification-tf/data/dataset_files
$ rclone copy rshare:/data/images /srv/image-classification-tf/data/images

Paths with the rshare prefix are Nextcloud paths. As always, paths are specific to this example. Your module might need different paths. If you zipped your files before uploading to Nextcloud you will have to rclone copy the zip file, unzip it and copy the contents to the appropriate folders.

Mounting your dataset might take some time, depending on the dataset size, file structure (lots of small files vs few big files), and so on. So grab a cup of coffee and prepare for the next steps.

Now that you dataset is mounted, we will run DEEPaaS to interactively run the training. In your terminal window type:

$ nohup deep-start --deepaas &

The & will keep your command running even if you close the terminal, and nohup will produce a log file nohup.out that you can always look at if you want to know what is going on under the hood.

5. Open the DEEPaaS API and train the model

Now go back to the deployments list view. In your deployment go to Access and choose DEEPaaS. You will be redirected to http://deepaas_endpoint/ui.

../../_images/deepaas.png

Look for the train POST method. Modify the training parameters you wish to change and execute.

If some kind of monitorization tool is available for the module, you will be able to follow the training progress at http://monitor_endpoint (click Access button ➜ Monitoring, in the deployments page). For example, in the image classification module, you can monitor training progress with Tensorboard.

../../_images/tensorboard.png

6. Test and export the newly trained model

Once the training has finished, you can directly test it by clicking on the predict POST method. For this you have to kill the process running deepaas, and launch it again.

$ kill -9 $(ps aux | grep '[d]eepaas-run' | awk '{print $2}')
$ kill -9 $(ps aux | grep '[t]ensorboard' | awk '{print $2}')  # optionally also kill monitoring process

This is because the user inputs for deepaas are generated at the deepaas launching. Thus it is not aware of the newly trained model. Once deepaas is restarted, head to the predict POST method, select you new model weights and upload the image your want to classify.

If you are satisfied with your model, then it’s time to save it into your remote storage, so that you still have access to it if your machine is deleted. For this we have to create a tar file with the model folder (in this case, the foldername is the timestamp at which the training was launched) so that we can download in our Docker container.

So go back to JupyterLab, open a Terminal window and run:

$ cd /srv/image-classification-tf/models
$ tar cfJ <modelname.tar.xz> <foldername>
$ rclone copy /srv/image-classification-tf/models rshare:/models

Now you should be able to see your new models weights in Nextcloud.

For the next step, you need to make them publicly available through an URL so they can be downloaded in your Docker container. In Nextcloud, go to the tar file you just created: ➜ Share Link ➜ (Create a new share link)

Zenodo preservation

Optionally, in order to improve the reproducibility of your code, we encourage you to share your training dataset on Zenodo. Once you upload the dataset, make sure to link it with the relevant Zenodo community (AI4EOSC, iMagine).

If long-term preservation and versioning of model weights is important to you, you can also upload the model weights to Zenodo in addition to Nextcloud.

7. Create a Docker repo for your new module

Now, let’s say you want to share your new application with your colleagues. The process is much simpler that when developing a new module from scratch, as your code is the same as the original application, only your model weights are different.

To account for this simpler process, we have prepared a version of the the AI4OS Modules Template specially tailored to this task:

  • Go to the Template creation webpage. You will need an authentication to access to this webpage.

  • Then select the child-module branch of the template and answer the questions.

  • Click on Generate and you will be able to download a .zip file with one project directory:

    ~/DEEP-OC-<project-name>
    

    Extract it locally.

Once this is done, the following steps are:

(1) Modify metadata.json with the proper description of your new module. This is the information that will be displayed in the Marketplace. Among the fields you might need to edit are:

  • title (mandatory): short title,

  • summary (mandatory): one liner summary of your module,

  • description (optional): extended description of your module, like a README,

  • keywords (mandatory): tags to make your module more findable

  • training_files_url (optional): the URL of your model weights and additional training information,

  • dataset_url (optional): the URL dataset URL,

  • cite_url (optional): the DOI URL of any related publication,

Most other fields are pre-filled via the AI4OS Modules Template and usually do not need to be modified. Check you didn’t mess up the JSON formatting by running:

$ pip install git+https://github.com/deephdc/schema4apps
$ deep-app-schema-validator metadata.json

Due to some issues with the JSON format parsing avoid using : in the values you are filling.

(2) Then go to the Dockerfile. You will see that the base Docker image is the image of the original repo. Modify the appropriate lines to replace the original model weights with the new model weights. In our case, this could look something like this:

ENV SWIFT_CONTAINER https://share.services.ai4os.eu/index.php/s/r8y3WMK9jwEJ3Ei/download
ENV MODEL_TAR phytoplankton.tar.xz

RUN rm -rf image-classification-tf/models/*
RUN curl --insecure -o ./image-classification-tf/models/${MODEL_TAR} \
    ${SWIFT_CONTAINER}/${MODEL_TAR}
RUN cd image-classification-tf/models && \
    tar -xf ${MODEL_TAR} &&\
    rm ${MODEL_TAR}

Check your Dockerfile works correctly by building it locally and running it:

$ docker build --no-cache -t your_project .
$ docker run -ti -p 5000:5000 -p 6006:6006 -p 8888:8888 your_project

Your module should be visible in http://0.0.0.0:5000/ui

Once you are fine with the state of your module, got to Github to create the repo https://github.com/<github-user>/DEEP-OC-<project-name> and push the changes.

8. Share your new module in the Marketplace

Once your repo is set, it’s time to make a PR to add your model to the marketplace!

For this you have to fork the code of the DEEP catalog repo (deephdc/deep-oc) and add your Docker repo name at the end of the MODULES.yml.

- module: https://github.com/deephdc/UC-<github-user>-DEEP-OC-<project-name>

You can do this directly online on GitHub or via the command line:

$ git clone https://github.com/[my-github-fork]
$ cd [my-github-fork]
$ echo '- module: https://github.com/deephdc/UC-<github-user>-DEEP-OC-<project-name>' >> MODULES.yml
$ git commit -a -m "adding new module to the catalogue"
$ git push

Once the changes are done, make a PR of your fork to the original repo and wait for approval. Check the GitHub Standard Fork & Pull Request Workflow in case of doubt.

When your module gets approved, you may need to commit and push a change to metadata.json in your https://github.com/<github-user>/DEEP-OC-<project-name> so that the Pipeline is run for the first time, and your module gets rendered in the marketplace.

9. [optional] Add your new module to the original Continuous Integration pipeline

Your module is already in the Marketplace. But what happens if the code in the original image-classification module changes? This should trigger a rebuild of your Docker container as it is based on that code.

This can be achieved by modifying the Jenkinsfile in the image-classification Docker repo. One would add an additional stage to the Jenkins pipeline like so:

stage("Re-build DEEP-OC Docker images for derived services") {
    when {
        anyOf {
           branch 'master'
           branch 'test'
           buildingTag()
        }
    }
    steps {

        // Wait for the base image to be correctly updated in DockerHub as it is going to be used as base for
        // building the derived images
        sleep(time:5, unit:"MINUTES")

        script {
            def derived_job_locations =
            ['Pipeline-as-code/DEEP-OC-org/DEEP-OC-plants-classification-tf',
             'Pipeline-as-code/DEEP-OC-org/DEEP-OC-conus-classification-tf',
             'Pipeline-as-code/DEEP-OC-org/DEEP-OC-seeds-classification-tf',
             'Pipeline-as-code/DEEP-OC-org/DEEP-OC-phytoplankton-classification-tf'
             ]

            for (job_loc in derived_job_locations) {
                job_to_build = "${job_loc}/${env.BRANCH_NAME}"
                def job_result = JenkinsBuildJob(job_to_build)
                job_result_url = job_result.absoluteUrl
            }
        }
    }
}

So if you want this step to be performed, you must submit a PR to the original module Docker repo with similar changes as above.

10. Next steps

Do you want to go further?

Tip

If you run into problems you can always check the Frequently Asked Questions (FAQ).