Reputation
Badges 1
979 × Eureka!This is what I get, when I am connected and when I am logged out (by clearing cache/cookies)
I added the pass_hashed and restarted the server, still get the same problem
Does what you suggested here > https://github.com/allegroai/trains-agent/issues/18#issuecomment-634551232 also applies for containers used by the services queue?
(I use trains-agent 0.16.1 and trains 0.16.2)
region is empty, I never entered it and it worked
So most likely trains was masking the original error, it might be worth investigating to help other users in the future
AgitatedDove14 The first time it installs and create the cache for the env, the second time it fails with:Applying uncommitted changes ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found. clearml_agent: ERROR: Command '['/home/user/.clearml/venvs-builds.1/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsmncaxx45.txt']' returned non-zero exit status 1.
These images are actually stored there and I can access them via the url shared above (the one written in the pop up message saying that these files could not be deleted)
The host is accessible, I can ping it and even run curl "
http://internal-aws-host-name:9200/_cat/shards "
and get results from the local machine
PS: in the new env, Iโv set num_replicas: 0, so Iโm only talking about primary shardsโฆ
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
there is no error from this side, I think the aws autoscaler just waits for the agent to connect, which will never happen since the agent wonโt start because the userdata script fails
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
nothing wrong from ClearML side ๐
/data/shared/miniconda3/bin/python /data/shared/miniconda3/bin/clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
AgitatedDove14 Is it possible to shut down the server while an experiment is running? I would like to resize the volume and then restart it (should take ~10 mins)
Will it freeze/crash/break/stop the ongoing experiments?
My use case it: in a spot instance marked for termination after 2 mins by aws, I want to close a task and prevent the clearml-agent to pick up a new task after.
I also discovered https://h2oai.github.io/wave/ last week, would be awesome to be able to deploy it in the same manner
I would probably leave it to the ClearML team to answer you, I am not using the UI app and for me it worked just well with different regions. Maybe check permissions of the key/secrets?
btw I see in the pytorch_distributed_example I see that you average_gradients
, but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.
Maybe there is setting in docker to move the space used in a different location? I can simply increase the storage of the first disk, no problem with that
Restarting the server ( docker-compose down
then docker-compose up
) solved the problem ๐ All experiments are back
I am trying to upload an artifact during the execution
That gave me
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda running python3
Building Task 94jfk2479851047c18f1fa60c1364b871 inside docker: ubuntu:18.04
Starting docker build
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
AgitatedDove14 This looks awesome! Unfortunately this would require a lot of changes in my current code, for that project I found a workaround ๐ But I will surely use it for the next pipelines I will build!