Frequently Asked Questions (FAQ)¶
This page gathers know issues of the platform, along with possible solutions. If your issue does not appear here, please contact support.
Hardware issues¶
🔥 The Dashboard shows my deployment but it immediately disappears¶
Sometimes it can happen that when you create a new deployment, it initially appears in the deployments table but disappears immediately, or is marked as failed/error
.
This usually happens in deployments that where launched using Nextcloud. It can happen that your Nextcloud credentials become invalid thus leading to failure when trying to launched your deployment with Nextcloud connected.
To fix this issue, please re-link your Nextcloud account and try deploying again.
If you are still experiencing this error after relinking, please contact support. If you are experiencing this issue in a deployment that was not linked with Nextcloud, please contact support.
We are debugging why the Nextcloud expiration happens in the first place.
🔥 The Dashboard shows there are free GPUs but my deployment is still queued¶
This can happen sometimes when a GPU gets stuck in the system and is not correctly freed.
Please contact support if this happens to you!
🔥 I ran out of disk in my deployment¶
You are trying to download some data but the following error is raised:
RESOURCE_EXHAUSTED: Out of memory while trying to allocate ******** bytes
This means that you have consumed more disk than what you initially requested. You can see your current disk consumption using:
$ df -h | grep overlay
This will show you three values, respectively the Total | Used | Remaining
disk.
To solve this first, make sure to delete files in the Trash (/root/.local/share/Trash/files
).
Files end up there when deleted from the JupyterLab UI, thus not freeing up the space
correctly.
If you still find you have not enough disk, you have two options:
create a new deployment, requesting more disk in the configuration,
access your Nextcloud dataset files via a virtual filesystem, in order to avoid overloading the disk.
🔥 My deployment does not correctly list my resources¶
The deployments in the platform are created as Docker containers. Therefore some resources might not be properly virtualized like in a traditional Virtual Machine. This means that standard commands for checking up resources might give you higher numbers than what is really available (ie. they give you the resources of the full Virtual Machine where Docker is running, not the resources avaible to your individual Docker container).
Standard commands:
CPU:
lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
RAM memory:
free -h
Disk:
df -h
Real available resources can be found with the following commands:
CPU:
printenv | grep NOMAD_CPU
will show both reserved cores (NOMAD_CPU_CORES
) and maximum CPU limit (in MHz) (NOMAD_CPU_LIMIT
).RAM memory:
echo $NOMAD_MEMORY_LIMIT
orcat /sys/fs/cgroup/memory/memory.limit_in_bytes
Disk:
df -h | grep overlay
will show you three values, respectively theTotal | Used | Remaining
disk
It is your job to program your application to make use of these real resources (eg. load smaller models, load less data, etc). Failing to do so could potentially make your process being killed for surpassing the available resources. For example, check how to limit CPU usage in Tensorflow or Pytorch.
ㅤㅤ More info
For example trying to allocate 8GB in a 4GB RAM machine will lead to failure.
root@2dc9e20f923e:/srv# stress -m 1 --vm-bytes 8G
stress: info: [69] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [69] (415) <-- worker 70 got signal 9
stress: WARN: [69] (417) now reaping child worker processes
stress: FAIL: [69] (451) failed run completed in 6s
🔥 My GPU just disappeared from my deployment¶
You try to list to GPU and it doesn’t appear:
$ nvidia-smi
Failed to initialize NVML: Unknown Error"
This is due to this issue. We are working on fixing this issue. If this is happening to you, please contact support.
In the meantime, your best option is to backup your data, delete your deployment and create a new one.
Storage issues¶
🔥 I cannot access /storage
¶
You try to access “/storage” and you get the message:
root@226c02330e9f:/srv# ls /storage
ls: reading directory '/storage': Input/output error
This probably means that you have entered the wrong credentials when configuring your deployment in the Dashboard.
You will need to delete the current deployment and make a new one. Follow our guidelines on how to get an RCLONE user and password to fill the deployment configuration form.
🔥 Accessing /storage
runs abnormally slow¶
This happens from time to time due to connectivity issues. If this behavior persists for more than a few days, try creating a new deployment.
If latency is still slow in the new deployment, please contact support.
🔥 I cannot find my dataset under /storage/ai4-storage
¶
This can happen if you are accessing the dataset from several deployments at the same
time, and the ls
command hasn’t properly refreshed its index.
To fix this you will need to cd to the folder and run cd . for the ls command to refresh its index (ref). Now you should be able to see your dataset.
🔥 rclone fails to connect¶
You tried to manually use RCLONE and you are returned the following error message:
2024/11/04 13:04:53 Failed to about: about call failed: No public access to this resource., Username or password was incorrect, No 'Authorization: Bearer' header found. Either the client didn't send one, or the server is mis-configured, Username or password was incorrect: Sabre\DAV\Exception\NotAuthenticated: 401 Unauthorized
This is probably due because you are using an older RCLONE version (earlier than 1.63.3
).
Update to a newer RCLONE version and find more information here.
Other issues¶
🔥 Service X is not working¶
Check the Status page to see if there is any maintenance action going on. If you don’t see anything, wait a couple of hours to make sure it is not a temporary issue.
If the issue persists, please contact support.
🚀 I would like to suggest a new feature¶
We are always happy improve our software based on user feedback.
Please open an issue in the Github repo of the component you are interested in:
If you think the documentation itself can be improved, don’t hesitate to open an issue or submit a Pull Request.
You can always check that your suggested feature is not on the Upcoming features list.