Badges 154 × Eureka!
btw, AgitatedDove14 I launch the agent
daemon outside docker (with
--docker ) , that’s the way it is supposed to work right?
$ clearml-agent daemon --detached --queue manual_jobs automated_jobs --docker --gpus 0
And then the worker itself will run the
docker run command for me and start another non-daemon agent inside.
I guess the failure happens when it tries to switch to docker because the same experiment works with agents not started with
AgitatedDove14 no I mean I can do:
` docker run -t --gpus "device=1" -dit -e APP_ENV=kprod -e CLEARML_WORKER_ID=ada:gpu1 -e CLEARML_DOCKER_IMAGE=922531023312.dkr.ecr.us-west-2.amazonaws.com/jym-coach:202108080511.7e8d6d1 -v /home/smjahad/.gitconfig:/root/.gitconfig -v /tmp/.clearml_agent.kjx6r9oo.cfg:/root/clearml.conf -v /tmp/clearml_agent.ssh.l8cguj81:/root/.ssh -v /home/smjahad/.clearml/apt-cache.1:/var/cache/apt/archives -v /home/smjahad/.clearml/pip-cache:/root/.cache/pip -v /home/smjah...
I tried with and without. I’m having the issue where if I run the task from the queue it will complete as soon as it goes into docker but if I run the same docker run it works.
I’m wondering, would an older version of the agent work well with a newer server version and vice-versa?
AgitatedDove14 this works:
pip install git+ssh://firstname.lastname@example.org/user/repo.git
AgitatedDove14 should I try running the above command with privileged user?
I think it’s great to let users build their own UI-connected apps, I’d use that for sure!
$ git remote -v fork email@example.com:salimmj/somerepo.git (fetch) fork firstname.lastname@example.org:salimmj/somerepo.git (push) origin email@example.com:mainuser/somerepo.git (fetch) origin firstname.lastname@example.org:mainuser/somerepo.git (push)I want to keep the above setup, the remote branch that will track my local will be on
fork so it needs to pull from there. Currently it recognizes
origin so it doesn’t work because the agent then can’t find the commit.
I know this is not the default behavior so I’d be happy with having the option to override the repo when I call
AgitatedDove14 wouldn’t the above command
task.execute_remotely(queue_name=None, clone=False, exit_process=False) fail because
clone==False and exit_process==False is not supported. Task enqueuing itself must exit the process afterwards.I thought it worked earlier 😮
It recognizes the main repo, but I want it to push and pull from another one (my own forked repo). AgitatedDove14
I already have that set to true and want that behavior. The issue is on the “committed” change set. When I push code to github I push to my fork and pull from the main/master repo (all changes go through PRs from fork to main).
Now when I use
execute_remotely , whatever code does the git discovery, considers whatever repo I
pull from the repo to use. But these changes haven’t necessarily been merged into main. The correct behavior would be to use the forked repo.
AgitatedDove14 when I try this I get
clearml.backend_interface.session.SendError: Action failed <400/110: tasks.enqueue/v1.0 (Invalid task status (Invalid status change): current_status=in_progress, new_status=queued)> (queue=e78d2fdf2d5140b6b5c6678338c532bb, task=95082c9174a04044b25253d724362ec1)
This is exactly what I was looking for. I thought once you call
execute_remotely the task is sent and it’s too late to change anything.
Fixed it by adding this code block. Makes sense.
if clone: task = Task.clone(self) else: task = self # check if the server supports enqueueing aborted/stopped Tasks if Session.check_min_api_server_version('2.13'): self.mark_stopped(force=True) else: self.reset()
For your second question, those are generated using custom tooling, it relies on the build system to be setup which is guaranteed by the docker image used. So I don’t think this is a case of supporting a specific env setup or build tool but just allowing custom script for env setup step / building code.
That won’t work 😕
The docker shell script runs too early in the process.
I want to inject a bash command after the repo has been clone (and maybe even after the venv has been installed).
TimelyPenguin76 After creating the venv (so I don’t have to do it myself). Once an env is there, I need to run a script while the env is activated from the root of the repo.
So when the repo is cloned and venv is created and activated I want to executed this from the repo:
SuccessfulKoala55 I tried to make a docker image by combining one of our dockerfiles with this https://github.com/allegroai/clearml-agent/blob/master/docker/agent/Dockerfile . I modified the
entrypoint to also be a combination of both.
Right now I’m not seeing that error, but the the process seems to exit (as completed) after the
docker run . I’m wondering if my Dockerfile is not properly setup and it’s exiting before the deamon is started.
Is it possible to set that at task enqueueing SuccessfulKoala55 ?
ugh, sudo actually makes it fail explicitly because
` error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
- Make sure you pushed the requested commit:
(email@example.com:salimmj/clearml-demo.git', branch='main', commit_id='f76f3affd28d5558928d7ffd9a6797890ffdd708', tag='', docker_cmd='nvidia/cuda:11.4.0-runtime-ubuntu20.04', entry_point='mnist.py', working_dir='.')
- Check if remote-wo...
It’s not that I think because it works if I run the same command manually.
$ python --version Python 3.6.8 $ python repo/toy_workflow.py --logtostderr --logtoclearml --clearml_queue=ada_manual_jobs 2021-08-07 04:04:16,844 - clearml - WARNING - Switching to remote execution, output log page https://...
On the webpage logs I see this:
2021-08-07 04:04:12 ClearML Task: created new task id=f1092bcbe30249639122a49a9b3f9145 ClearML results page: `
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
In fact, if there is a good python API to list/duplicate/edit/run experiments by ID, it seems straightforward to do that from Airflow (or any other job scheduler). I’m just wondering if there is some built-in scheduler.
I don’t mean a serving endpoint, just the equivalent of “cloning an experiment” and running it on a different (larger) dataset.
Issue seems fixed now, thanks! Is the fact that clearml-agent needs to be installed from system python mentioned anywhere in the docs, if not I suggest it gets added.
Thank you so much for helping.
OH! I was installing it on an env