Reputation
Badges 1
981 × Eureka!Done! Also I tried to use git cache ( https://git-scm.com/docs/git-credential-cache ) as a workaround (hoping that the first time it clones the experiment repo, it caches the creds for the next times, but I then get a different error: fatal: unable to find a suitable socket path; use --socket )
I want to make sure that an agent did finish uploading its artifacts before marking itself as complete, so that the controller does not try to access these artifacts while they are not available
nvm, bug might be from my side. I will open an issue if I find any easy reproducible example
Just tried, still the same issue
This is how I start the agent that is running the two experiments in parallel:python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached
Sure, just sent you a screenshot in PM
you mean “docker” was not installed and it did not throw an error ?
Yes docker was not installed in the machine
Yes you must make sure the docker can mount a persistent folder for you to work on.
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
btw SuccessfulKoala55 the parameter is not documented in https://allegro.ai/clearml/docs/docs/references/clearml_ref.html#sdk-development-worker
When an experiment on trains-agent-1 is finished, I see randomly no experiment/long experiment and when two experiments are running, I see randomly one of the two experiments
SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didn’t update the apiserver.auth.cookies.domain setting - I did it, restarted and now it works 🙂 Thanks!
` Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.toc3_yks.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.1dsz4bz8:/root/.ssh', '-v', '/home/user/.clearml/apt-cache.2:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '...
I am doing:try: score = get_score_for_task(subtask) except: score = pd.NA finally: df_scores = df_scores.append(dict(task=subtask.id, score=score, ignore_index=True) task.upload_artifact("metric_summary", df_scores)
AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?
I also did run sudo apt install nvidia-cuda-toolkit
Ok, I guess I’ll just delete the whole loss series. Thanks!
AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats 🚀 🙌 Was this bug fixed with this new version?
So I changed ebs_device_name = "/dev/sda1" , and now I correctly get the 100gb EBS volume mounted on / . All good 👍
These images are actually stored there and I can access them via the url shared above (the one written in the pop up message saying that these files could not be deleted)
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
yes, that's also what I thought
Hi SuccessfulKoala55 , there it is > https://github.com/allegroai/clearml-server/issues/100
I hit F12 to check projects.get_all_ex but nothing is fired, I guess the web ui is just frozen in some weird state
Yes, that's what it looks like. Somehow when you clone the experiment repo, you correctly set the git creds in the url, but when the dependencies are installed, the git creds are not taken in account