Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Face A Strange Behavior From The Clearml-Agent: It’S Running In Services Mode, Not In Docker Mode, Cpu Only. I Want To Execute Two Tasks On This Service Agent. One Works, The Other Always Fails After Being Enqueued And Picked By The Agent With The E

Hi, I face a strange behavior from the clearml-agent: it’s running in services mode, not in docker mode, cpu only. I want to execute two tasks on this service agent. One works, the other always fails after being enqueued and picked by the agent with the error:
task xyz pulled from xyz by worker clearml-server:services Running task 'xyz' User aborted: stopping task (3)I can see the task in the queue page of the dashboard for some minutes and then it disappear. The task failing doesn’t start my python code at all (I’ve put print statements at the beginning). Any idea why it could happen? (clearml-agent version 1.0.0, clearml version 1.0.4)

  
  
Posted 2 years ago
Votes Newest

Answers 30


SInce it fails on the first machine (clearml-server), I try to run it on another, on-prem machine (also used as an agent)

  
  
Posted 2 years ago

I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird

  
  
Posted 2 years ago

and in the logs:
agent.worker_name = worker1 agent.force_git_ssh_protocol = false agent.python_binary = agent.package_manager.type = pip agent.package_manager.pip_version = \=\=20.2.3 agent.package_manager.system_site_packages = true agent.package_manager.force_upgrade = false agent.package_manager.conda_channels.0 = pytorch agent.package_manager.conda_channels.1 = conda-forge agent.package_manager.conda_channels.2 = defaults agent.package_manager.torch_nightly = false agent.venvs_dir = /home/machine/.clearml/venvs-builds.1.2 agent.venvs_cache.max_entries = 10 agent.venvs_cache.free_space_threshold_gb = 2.0 agent.vcs_cache.enabled = true agent.vcs_cache.path = /home/machine/.clearml/vcs-cache agent.venv_update.enabled = false agent.pip_download_cache.enabled = true agent.pip_download_cache.path = /home/machine/.clearml/pip-download-cache agent.translate_ssh = true agent.reload_config = false agent.docker_pip_cache = /home/machine/.clearml/pip-cache agent.docker_apt_cache = /home/machine/.clearml/apt-cache.1.2 agent.docker_force_pull = false agent.default_docker.image = nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 agent.enable_task_env = false agent.default_python = 3.6 agent.cuda_version = 0 agent.cudnn_version = 0

  
  
Posted 2 years ago

yes exactly

  
  
Posted 2 years ago

SuccessfulKoala55 I tried to setup in a different machine the clearml-agent and now I get a different error message in the logs:
Warning: could not locate requested Python version 3.6, reverting to version 3.6 clearml_agent: ERROR: Python executable with version '3.6' defined in configuration file, key 'agent.default_python', not found in path, tried: ('python3.6', 'python3', 'python')

  
  
Posted 2 years ago

if not set, this value is taken from the system python

  
  
Posted 2 years ago

Nice, seems to work! 🎉

  
  
Posted 2 years ago

Oof now I cannot start the second controller in the services queue on the same second machine, it fails with
Processing /tmp/build/80754af9/cffi_1605538068321/work ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1605538068321/work' clearml_agent: ERROR: Could not install task requirements! Command '['/home/machine/.clearml/venvs-builds.1.3/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsi4hq9s6z.txt']' returned non-zero exit status 1.

  
  
Posted 2 years ago

Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))

  
  
Posted 2 years ago

whatever will allow the agent daemon to create a venv 🙂

  
  
Posted 2 years ago

Ok, deleting installed packages list worked for the first task

  
  
Posted 2 years ago

The file  /tmp/.clearml_agent_out.j7wo7ltp.txt   does not exist

  
  
Posted 2 years ago

Can I simply set agent.python_binary = path/to/conda/python3.6 ?

  
  
Posted 2 years ago

Python executable with version '3.6' defined in configuration file

  
  
Posted 2 years ago

I execute the clearml-agent this way:
/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached

  
  
Posted 2 years ago

User aborted: stopping task usually means Task status changed or "stopping" was placed in the status_message field while the task was running

  
  
Posted 2 years ago

OK, that makes more sense, I guess

  
  
Posted 2 years ago

agent.default_python = 3.6

  
  
Posted 2 years ago

I think clearml-agent tries to execute /usr/bon/python3.6 to start the task, instead of using the python used to start clearml-agent

  
  
Posted 2 years ago

Also, what do you mean by another machine? Are you running the ClearML services agent daemon on another machine?

  
  
Posted 2 years ago

interestingly, it works on one machine, but not on another one

  
  
Posted 2 years ago

same as the first one described

  
  
Posted 2 years ago

JitteryCoyote63 can you try to look at the logs in /tmp/.clearml_agent_out.j7wo7ltp.txt ?

  
  
Posted 2 years ago

What is configured?

  
  
Posted 2 years ago

Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2

  
  
Posted 2 years ago

but it is set here, right?

  
  
Posted 2 years ago

so the clearml-agent daemon takes 3.6 as the default, and when running the service, for some reason 3.6 is not in the path

  
  
Posted 2 years ago

What's the error on the other machine?

  
  
Posted 2 years ago

exactly

  
  
Posted 2 years ago

in clearml.conf:
agent.package_manager.system_site_packages = true agent.package_manager.pip_version = "==20.2.3"

  
  
Posted 2 years ago
537 Views
30 Answers
2 years ago
one year ago
Tags
Similar posts