
Reputation
Badges 1
54 × Eureka!It is indeed autopopulated by init
Fixed it by adding this code block. Makes sense.if clone: task = Task.clone(self) else: task = self # check if the server supports enqueueing aborted/stopped Tasks if Session.check_min_api_server_version('2.13'): self.mark_stopped(force=True) else: self.reset()
The private_package
can be installed by doing pip install
git+ssh://git@github.com/user/private_package.git but the agent is trying to do pip install private_package
which wonāt work.
I am already forcing ssh auth
I think itās great to let users build their own UI-connected apps, Iād use that for sure!
Youāre saying thereās a built-in scheduler? SuccessfulKoala55
If so where can I find it?
AgitatedDove14 when I try this I getclearml.backend_interface.session.SendError: Action failed <400/110: tasks.enqueue/v1.0 (Invalid task status (Invalid status change): current_status=in_progress, new_status=queued)> (queue=e78d2fdf2d5140b6b5c6678338c532bb, task=95082c9174a04044b25253d724362ec1)
ugh, sudo actually makes it fail explicitly because
` error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
- Make sure you pushed the requested commit:
(repository='git@github.com:salimmj/clearml-demo.git', branch='main', commit_id='f76f3affd28d5558928d7ffd9a6797890ffdd708', tag='', docker_cmd='nvidia/cuda:11.4.0-runtime-ubuntu20.04', entry_point='mnist.py', working_dir='.') - Check if remote-wo...
Iām wondering, would an older version of the agent work well with a newer server version and vice-versa?
I tried with and without. Iām having the issue where if I run the task from the queue it will complete as soon as it goes into docker but if I run the same docker run it works.
I donāt mean a serving endpoint, just the equivalent of ācloning an experimentā and running it on a different (larger) dataset.
That wonāt work š
The docker shell script runs too early in the process.
I want to inject a bash command after the repo has been clone (and maybe even after the venv has been installed).
Iām not sure but it seems like you get different kinds of flexibility depending on whether you enqueue the task yourself or whether you rely on execute_remotely
. I think ideally if I could choose to get the benefit of auto-scanning provided by execute_remotely
as well as more flexibility it would be great.
EagerOtter28 Iām running into a similar situation as you.
I think you could use --standalone-mode
and do the cloning yourself in the docker bash script that you can configure in the agent config.
Being able to create and remove queues as well as list their contents.
AgitatedDove14 it was executed with Python 3 and Iām running in venv mode.
It doesnāt install it automatically, I think I need to specify it somewhere, see the above error. Or am I misunderstanding?
AgitatedDove14 should I try running the above command with privileged user?
Itās not that I think because it works if I run the same command manually.
SuccessfulKoala55 I tried to make a docker image by combining one of our dockerfiles with this https://github.com/allegroai/clearml-agent/blob/master/docker/agent/Dockerfile . I modified the entrypoint
to also be a combination of both.
Right now Iām not seeing that error, but the the process seems to exit (as completed) after the docker run
. Iām wondering if my Dockerfile is not properly setup and itās exiting before the deamon is started.
btw, AgitatedDove14 I launch the agent daemon
outside docker (with --docker
) , thatās the way it is supposed to work right?
$ clearml-agent daemon --detached --queue manual_jobs automated_jobs --docker --gpus 0
And then the worker itself will run the docker run
command for me and start another non-daemon agent inside.
I guess the failure happens when it tries to switch to docker because the same experiment works with agents not started with --docker
flag