
Reputation
Badges 1
54 × Eureka!btw, AgitatedDove14 I launch the agent daemon
outside docker (with --docker
) , thatās the way it is supposed to work right?
$ clearml-agent daemon --detached --queue manual_jobs automated_jobs --docker --gpus 0
And then the worker itself will run the docker run
command for me and start another non-daemon agent inside.
I guess the failure happens when it tries to switch to docker because the same experiment works with agents not started with --docker
flag
AgitatedDove14 no I mean I can do:
` docker run -t --gpus "device=1" -dit -e APP_ENV=kprod -e CLEARML_WORKER_ID=ada:gpu1 -e CLEARML_DOCKER_IMAGE=922531023312.dkr.ecr.us-west-2.amazonaws.com/jym-coach:202108080511.7e8d6d1 -v /home/smjahad/.gitconfig:/root/.gitconfig -v /tmp/.clearml_agent.kjx6r9oo.cfg:/root/clearml.conf -v /tmp/clearml_agent.ssh.l8cguj81:/root/.ssh -v /home/smjahad/.clearml/apt-cache.1:/var/cache/apt/archives -v /home/smjahad/.clearml/pip-cache:/root/.cache/pip -v /home/smjah...
I tried with and without. Iām having the issue where if I run the task from the queue it will complete as soon as it goes into docker but if I run the same docker run it works.
Iām wondering, would an older version of the agent work well with a newer server version and vice-versa?
AgitatedDove14 this works: pip install
git+ssh://git@github.com/user/repo.git
AgitatedDove14 should I try running the above command with privileged user?
I think itās great to let users build their own UI-connected apps, Iād use that for sure!
$ git remote -v fork git@github.com:salimmj/somerepo.git (fetch) fork git@github.com:salimmj/somerepo.git (push) origin git@github.com:mainuser/somerepo.git (fetch) origin git@github.com:mainuser/somerepo.git (push)
I want to keep the above setup, the remote branch that will track my local will be on fork
so it needs to pull from there. Currently it recognizes origin
so it doesnāt work because the agent then canāt find the commit.
I know this is not the default behavior so Iād be happy with having the option to override the repo when I call execute_remotely
AgitatedDove14 wouldnāt the above command task.execute_remotely(queue_name=None, clone=False, exit_process=False)
fail becauseclone==False and exit_process==False is not supported. Task enqueuing itself must exit the process afterwards.
I thought it worked earlier š®
It recognizes the main repo, but I want it to push and pull from another one (my own forked repo). AgitatedDove14
I already have that set to true and want that behavior. The issue is on the ācommittedā change set. When I push code to github I push to my fork and pull from the main/master repo (all changes go through PRs from fork to main).
Now when I use execute_remotely
, whatever code does the git discovery, considers whatever repo I pull
from the repo to use. But these changes havenāt necessarily been merged into main. The correct behavior would be to use the forked repo.
AgitatedDove14 when I try this I getclearml.backend_interface.session.SendError: Action failed <400/110: tasks.enqueue/v1.0 (Invalid task status (Invalid status change): current_status=in_progress, new_status=queued)> (queue=e78d2fdf2d5140b6b5c6678338c532bb, task=95082c9174a04044b25253d724362ec1)
This is exactly what I was looking for. I thought once you call execute_remotely
the task is sent and itās too late to change anything.
Fixed it by adding this code block. Makes sense.if clone: task = Task.clone(self) else: task = self # check if the server supports enqueueing aborted/stopped Tasks if Session.check_min_api_server_version('2.13'): self.mark_stopped(force=True) else: self.reset()
For your second question, those are generated using custom tooling, it relies on the build system to be setup which is guaranteed by the docker image used. So I donāt think this is a case of supporting a specific env setup or build tool but just allowing custom script for env setup step / building code.
WDYT?
That wonāt work š
The docker shell script runs too early in the process.
I want to inject a bash command after the repo has been clone (and maybe even after the venv has been installed).
TimelyPenguin76 After creating the venv (so I donāt have to do it myself). Once an env is there, I need to run a script while the env is activated from the root of the repo.
So when the repo is cloned and venv is created and activated I want to executed this from the repo: tools/setup_dependencies.sh
SuccessfulKoala55 I tried to make a docker image by combining one of our dockerfiles with this https://github.com/allegroai/clearml-agent/blob/master/docker/agent/Dockerfile . I modified the entrypoint
to also be a combination of both.
Right now Iām not seeing that error, but the the process seems to exit (as completed) after the docker run
. Iām wondering if my Dockerfile is not properly setup and itās exiting before the deamon is started.
Is it possible to set that at task enqueueing SuccessfulKoala55 ?
ugh, sudo actually makes it fail explicitly because
` error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
- Make sure you pushed the requested commit:
(repository='git@github.com:salimmj/clearml-demo.git', branch='main', commit_id='f76f3affd28d5558928d7ffd9a6797890ffdd708', tag='', docker_cmd='nvidia/cuda:11.4.0-runtime-ubuntu20.04', entry_point='mnist.py', working_dir='.') - Check if remote-wo...
Itās not that I think because it works if I run the same command manually.
$ python --version Python 3.6.8 $ python repo/toy_workflow.py --logtostderr --logtoclearml --clearml_queue=ada_manual_jobs 2021-08-07 04:04:16,844 - clearml - WARNING - Switching to remote execution, output log page https://...
On the webpage logs I see this:2021-08-07 04:04:12 ClearML Task: created new task id=f1092bcbe30249639122a49a9b3f9145 ClearML results page:
`
2021-08-07 04:04:14
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
2021-08...
In fact, if there is a good python API to list/duplicate/edit/run experiments by ID, it seems straightforward to do that from Airflow (or any other job scheduler). Iām just wondering if there is some built-in scheduler.
I donāt mean a serving endpoint, just the equivalent of ācloning an experimentā and running it on a different (larger) dataset.
Issue seems fixed now, thanks! Is the fact that clearml-agent needs to be installed from system python mentioned anywhere in the docs, if not I suggest it gets added.
Thank you so much for helping.
OH! I was installing it on an env