Reputation
Badges 1
979 × Eureka!but the post_packages does not reinstalls the version 1.7.1
I don’t have a registry to push my image to.I think I can get around it actually - Will it work if I just build the image locally once, then start the agent? Docker would recognise that image locally and just use it right? I won’t need to update that image often anyway
Done! Also I tried to use git cache ( https://git-scm.com/docs/git-credential-cache ) as a workaround (hoping that the first time it clones the experiment repo, it caches the creds for the next times, but I then get a different error: fatal: unable to find a suitable socket path; use --socket
)
Sure, where can I find this file?
I can ssh into the agent and:source /trains-agent-venv/bin/activate (trains_agent_venv) pip show pyjwt Version: 1.7.1
Stopping the server Editing the docker-compose.yml file, adding the logging section to all services Restarting the serverDocker-compose freed 10Go of logs
Yes that’s what I did initially, but eventually I decided that it’s too much complexity added for nothing really, I’d rather drop omegaconf and if one day clearml supports it out of the box take advantage of it
Thanks TimelyPenguin76 and AgitatedDove14 ! I would like to delete artifacts/models related to the old archived experiments, but they are stored on s3. Would that be possible?
Hi AgitatedDove14 , sorry somehow this message got lost 😄
clearml version is the latest at the time, 1.7.1
Yes, I always see the "model uploaded completed" for such stuck tasks I am using python 3.8.10
yes but they are in plain text and I would like to avoid that
How exactly is the clearml-agent killing the task?
The task I cloned from is not the one I though
AgitatedDove14 I see that the default sample_frequency_per_sec=2.
, but in the UI, I see that there isn’t such resolution (ie. it logs every ~120 iterations, corresponding to ~30 secs.) What is the difference with report_frequency_sec=30.
?
Nice, thanks!
I found, the filter actually has to be an iterable:Task.get_tasks(project_name="my-project", task_name="my-task", task_filter=dict(type=["training"])))
Note: I can verify that post_packages is well picked up by the trains-agent, since in the experiment log I see:agent.package_manager.type = pip agent.package_manager.pip_version = \=\=20.2.3 agent.package_manager.system_site_packages = true agent.package_manager.force_upgrade = false agent.package_manager.post_packages.0 = PyJWT\=\=1.7.1
The rest of the configuration is set with env variables
Oof now I cannot start the second controller in the services queue on the same second machine, it fails with
` Processing /tmp/build/80754af9/cffi_1605538068321/work
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1605538068321/work'
clearml_agent: ERROR: Could not install task requirements!
Command '['/home/machine/.clearml/venvs-builds.1.3/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r'...
Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2
Can I simply set agent.python_binary = path/to/conda/python3.6
?
Ok, deleting installed packages list worked for the first task
in clearml.conf:agent.package_manager.system_site_packages = true agent.package_manager.pip_version = "==20.2.3"
SInce it fails on the first machine (clearml-server), I try to run it on another, on-prem machine (also used as an agent)
After I started clearml-session
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached