SInce it fails on the first machine (clearml-server), I try to run it on another, on-prem machine (also used as an agent)
Also, what do you mean by another machine? Are you running the ClearML services agent daemon on another machine?
in clearml.conf:agent.package_manager.system_site_packages = true agent.package_manager.pip_version = "==20.2.3"
SuccessfulKoala55 I tried to setup in a different machine the clearml-agent and now I get a different error message in the logs:Warning: could not locate requested Python version 3.6, reverting to version 3.6 clearml_agent: ERROR: Python executable with version '3.6' defined in configuration file, key 'agent.default_python', not found in path, tried: ('python3.6', 'python3', 'python')
I think clearml-agent tries to execute /usr/bon/python3.6 to start the task, instead of using the python used to start clearml-agent
What's the error on the other machine?
The file /tmp/.clearml_agent_out.j7wo7ltp.txt
does not exist
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
so the clearml-agent daemon takes 3.6 as the default, and when running the service, for some reason 3.6 is not in the path
whatever will allow the agent daemon to create a venv 🙂
if not set, this value is taken from the system python
Python executable with version '3.6' defined in configuration file
Can I simply set agent.python_binary = path/to/conda/python3.6
?
JitteryCoyote63 can you try to look at the logs in /tmp/.clearml_agent_out.j7wo7ltp.txt
?
and in the logs:agent.worker_name = worker1 agent.force_git_ssh_protocol = false agent.python_binary = agent.package_manager.type = pip agent.package_manager.pip_version = \=\=20.2.3 agent.package_manager.system_site_packages = true agent.package_manager.force_upgrade = false agent.package_manager.conda_channels.0 = pytorch agent.package_manager.conda_channels.1 = conda-forge agent.package_manager.conda_channels.2 = defaults agent.package_manager.torch_nightly = false agent.venvs_dir = /home/machine/.clearml/venvs-builds.1.2 agent.venvs_cache.max_entries = 10 agent.venvs_cache.free_space_threshold_gb = 2.0 agent.vcs_cache.enabled = true agent.vcs_cache.path = /home/machine/.clearml/vcs-cache agent.venv_update.enabled = false agent.pip_download_cache.enabled = true agent.pip_download_cache.path = /home/machine/.clearml/pip-download-cache agent.translate_ssh = true agent.reload_config = false agent.docker_pip_cache = /home/machine/.clearml/pip-cache agent.docker_apt_cache = /home/machine/.clearml/apt-cache.1.2 agent.docker_force_pull = false agent.default_docker.image = nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 agent.enable_task_env = false agent.default_python = 3.6 agent.cuda_version = 0 agent.cudnn_version = 0
interestingly, it works on one machine, but not on another one
User aborted: stopping task
usually means Task status changed or "stopping" was placed in the status_message
field while the task was running
Oof now I cannot start the second controller in the services queue on the same second machine, it fails withProcessing /tmp/build/80754af9/cffi_1605538068321/work ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1605538068321/work' clearml_agent: ERROR: Could not install task requirements! Command '['/home/machine/.clearml/venvs-builds.1.3/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsi4hq9s6z.txt']' returned non-zero exit status 1.
Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2
Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))
Ok, deleting installed packages list worked for the first task