Frequently Asked Questions (FAQ)¶
This page gathers know issues of the platform, along with possible solutions. If your issue does not appear here, please contact support.
Hardware issues¶
🔥 The Dashboard shows my deployment but it immediately disappears¶
Sometimes it can happen that when you create a new deployment, it initially appears in the deployments table but disappears immediately, or is marked as failed/error
.
This usually happens in deployments that where launched using Nextcloud. It can happen that your Nextcloud credentials become invalid thus leading to failure when trying to launched your deployment with Nextcloud connected.
To fix this issue, please re-link your Nextcloud account and try deploying again.
If you are still experiencing this error after relinking, please contact support. If you are experiencing this issue in a deployment that was not linked with Nextcloud, please contact support.
We are debugging why the Nextcloud expiration happens in the first place.
🔥 The Dashboard shows there are free GPUs but my deployment is still queued¶
This can happen sometimes when a GPU gets stuck in the system and is not correctly freed.
Please contact support if this happens to you!
🔥 I ran out of disk in my deployment¶
You are trying to download some data but the following error is raised:
RESOURCE_EXHAUSTED: Out of memory while trying to allocate ******** bytes
This means that you have consumed more disk than what you initially requested. You can see your current disk consumption using:
$ df -h | grep overlay
This will show you three values, respectively the Total | Used | Remaining
disk.
To solve this first, make sure to delete files in the Trash (/root/.local/share/Trash/files
).
Files end up there when deleted from the JupyterLab UI, thus not freeing up the space
correctly.
If you still find you have not enough disk, you have two options:
create a new deployment, requesting more disk in the configuration,
access your Nextcloud dataset files via a virtual filesystem, in order to avoid overloading the disk.
🔥 My deployment does not correctly list my resources¶
The deployments in the platform are created as Docker containers. Therefore some resources might not be properly virtualized like in a traditional Virtual Machine. This means that standard commands for checking up resources might give you higher numbers than what is really available (ie. they give you the resources of the full Virtual Machine where Docker is running, not the resources avaible to your individual Docker container).
Standard commands:
CPU:
lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
RAM memory:
free -h
Disk:
df -h
Real available resources can be found with the following commands:
CPU:
printenv | grep NOMAD_CPU
will show both reserved cores (NOMAD_CPU_CORES
) and maximum CPU limit (in MHz) (NOMAD_CPU_LIMIT
).RAM memory:
echo $NOMAD_MEMORY_LIMIT
orcat /sys/fs/cgroup/memory/memory.limit_in_bytes
Disk:
df -h | grep overlay
will show you three values, respectively theTotal | Used | Remaining
disk
It is your job to program your application to make use of these real resources (eg. load smaller models, load less data, etc). Failing to do so could potentially make your process being killed for surpassing the available resources. For example, check how to limit CPU usage in Tensorflow or Pytorch.
ㅤㅤ More info
For example trying to allocate 8GB in a 4GB RAM machine will lead to failure.
root@2dc9e20f923e:/srv# stress -m 1 --vm-bytes 8G
stress: info: [69] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [69] (415) <-- worker 70 got signal 9
stress: WARN: [69] (417) now reaping child worker processes
stress: FAIL: [69] (451) failed run completed in 6s
🔥 My GPU just disappeared from my deployment¶
You try to list to GPU and it doesn’t appear:
$ nvidia-smi
Failed to initialize NVML: Unknown Error"
This is due to this issue. We are working on fixing this issue. If this is happening to you, please contact support.
In the meantime, your best option is to backup your data, delete your deployment and create a new one.
Storage issues¶
🔥 I cannot access /storage
¶
You try to access “/storage” and you get the message:
root@226c02330e9f:/srv# ls /storage
ls: reading directory '/storage': Input/output error
This probably means that you have entered the wrong credentials when configuring your deployment in the Dashboard.
You will need to delete the current deployment and make a new one. Follow our guidelines on how to get an RCLONE user and password to fill the deployment configuration form.
🔥 Accessing /storage
runs abnormally slow¶
This happens from time to time due to connectivity issues. If this behavior persists for more than a few days, try creating a new deployment.
If latency is still slow in the new deployment, please contact support.
🔥 I cannot find my dataset under /storage/ai4-storage
¶
Option 1: Refresh the index
This can happen if you are accessing the dataset from several deployments at the same
time, and the ls
command hasn’t properly refreshed its index.
To fix this you will need to cd to the folder and run cd . for the ls command to refresh its index (ref). Now you should be able to see your dataset.
Option 2: Download error
It can also happen that your dataset failed to download for some reasons.
In the file ai4os.log
you will find the reason of the failure (eg. timeout).
You have several options:
Option 1: redeploy and see if the timeout error is no longer happening,
Option 2: try to download the dataset with the CLI using datahugger:
pip install datahugger datahugger "<doi>" "<data_dir>"
Option 3: download your dataset manually and paste it to Nextcloud
🔥 rclone fails to connect¶
You tried to manually use RCLONE and you are returned the following error message:
2024/11/04 13:04:53 Failed to about: about call failed: No public access to this resource., Username or password was incorrect, No 'Authorization: Bearer' header found. Either the client didn't send one, or the server is mis-configured, Username or password was incorrect: Sabre\DAV\Exception\NotAuthenticated: 401 Unauthorized
This is probably due because you are using an older RCLONE version (earlier than 1.63.3
).
Update to a newer RCLONE version and find more information here.
Other issues¶
🔥 Service X is not working¶
Check the Status page to see if there is any maintenance action going on. If you don’t see anything, wait a couple of hours to make sure it is not a temporary issue.
If the issue persists, please contact support.
ℹ️ I received a cluster downtime notification, what should I do?¶
If a downtime is expected, you should backup your work in order to avoid losing data. Sometimes, when the downtime is performed only in some nodes of the cluster, you might recover your original work after the downtime. But you should backup it anyway, just to be on the safe side.
How to backup modules?¶
There are two options. To be extra-safe, you can run both of them:
Create a snapshot from your deployment. After the downtime you should be able to redeploy it and restart your work where you left it. This is the most comprehensive option, as it saves both your data and the software/configuration you installed in your deployment.
Save your data somewhere.
If your deployment is connected with the AI4OS Storage, you can move your work under
/storage
. It will automatically write the data into Nextcloud.Anyway, it’s always good practice to develop under the
/storage
path because, in that way, your work is automatically synced with Nextcloud, thus preventing data loss in case of an unforeseen data failure.If you are using git, you can commit your work to Github.
If you are accessing your deployment via an IDE, you can the available options to directly download your files.
How to backup tools?¶
Snapshot creation is not supported for tools. Therefore you will need to manually backup the data (different options are available for each tool).
In the case of CVAT deployments, you can perform both these actions:
deleting your CVAT deployment will automatically create a snapshot in the platform from which you will be able to restore later on,
🚀 I would like to suggest a new feature¶
We are always happy improve our software based on user feedback.
Please open an issue in the Github repo of the component you are interested in:
If you think the documentation itself can be improved, don’t hesitate to open an issue or submit a Pull Request.
You can always check that your suggested feature is not on the Upcoming features list.