Reputation
Badges 1
25 × Eureka!generally speaking the agent will convert the repo url to the auth scheme it is configured with, ssh->hhtp if using user/pass, and http->ssh if using ssh
π Let me know if it solved the issue π
JitteryCoyote63 I remember something with "!" in the name or maybe "/" in the name that might cause this behavior. May I suggest checking with clearml-server 1.3 ?
Are you running the agent in docker mode? or venv mode ?
Can you manually ssh on port 10022 to the remote agent's machine ?ssh -p 10022 root@agent_ip_here
GreasyPenguin14 whats the clearml version you are using, OS & Python ?
Notice this happens on the "connect_configuration" that seems to be called after the Task was closed, could that be the case ?
Would this be equivalent to an automated job submission from clearml to the cluster?
yes exactly
I am looking for a setup which allows me to essentially create the workers and start the tasks from a slurm script
hmm I see, basically the slurm Admins are afraid you will create a script the clogs the SLURM cluster, hence no automated job submission, so you want to use slurm as a "time on cluster" and then when your time is allocated, use clearml for the job submission, is that cor...
Hey, is it possible for me to upload a pdf as an artefact?
Sure, just point to the file and it will upload it for you π
@<1569496075083976704:profile|SweetShells3> remove these from your pbtext:
name: "conformer_encoder"
platform: "onnxruntime_onnx"
default_model_filename: "model.bin"
Second, what do you have in your preprocess_encoder.py ?
And where are you getting the Error? (is it from the triton container? or from the Rest request?
Hi EagerOtter28
The agent knows how to do the http->ssh conversion on the fly, in your cleaml.conf (on the agent's machine) set force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/42606d9247afbbd510dc93eeee966ddf34bb0312/docs/clearml.conf#L25
and do you have import tensorflow in your code?
Out of curiosity, what ended up being the issue?
So in a simple "all-or-nothing"
Actually this is the only solution unless preemption is supported, i.e. abort running Task to free-up an agent...
There is no "magic" solution for complex multi-node scheduling, even SLURM will essentially do the same ...
Hi @<1547028074090991616:profile|ShaggySwan64>
. If I have a local repo cloned with ssh, the agent will attempt to replace the repo url with https,
Yes if you provide git user/pass (or user / app-pass) the agent would automatically replace and ssh:// repo link with the equivalent https:// and user the user/pass for authentication
but it seems that it doesn't remove the 2222 port in my case. That leads to
Hmm,,, what's the clearml-agent version? if this is not the latest 2.0.0r...
But first I want to make sure the verify argument is actually used, hence False
That would be great! Might have to useΒ
2>/dev/null
Β in some of my bash scripts
Feel free to test and PR :)
One other question regarding connecting. We have setup sshd inside the docker image we are using.
Actually the remote session opens port 10022 on the host machine (so it does not collide with the default ssh port)
It actually runs an additional sshd inside the docker, setting its port.
And the clearml-session will ssh directly into the container sshd...
Okay verified, it won't work with the demo server. give me a minute π
GrittyKangaroo27 any chance you can open a GitHub issue so this is not forgotten ?
(btw: we I think 1.1.6 is going to be released later today, then we will have a few RC with improvements on the pipeline, I will make sure we add that as well)
Hi @<1547028074090991616:profile|ShaggySwan64>
I'm guessing just copying the data folder with rsync is not the most robust way to do that since there can be writes into mongodb etc.
Yep
Does anyone have experience with something like that?
basically you should just backup the 3 DBs (mongo, redis, elastic) each one based on their own backup workflows. Then just rsync the files server & configuration.
How about this one:
None
copy paste the trains.conf from any machine, it just need the definition of the trains-server address.
Specifically if you run in offline mode, there is no need for the trains.conf and you can just copy the one on GitHub
BTW, this one seems to work ....
` from time import sleep
from clearml import Task
Task.set_offline(True)
task = Task.init(project_name="debug", task_name="offline test")
print("starting")
for i in range(300):
print(f"{i}")
sleep(1)
print("done") `
@<1524922424720625664:profile|TartLeopard58> @<1545216070686609408:profile|EnthusiasticCow4>
Notice that when you are spinning multiple agents on the same GPU, the Tasks should request the "correct" fractional GPU container, i.e. if they pick a "regular" no mem limit.
So something like
CLEARML_WORKER_NAME=host-gpu0a clearml-agent daemon --gpus 0 clearml/fractional-gpu:u22-cu12.3-2gb
CLEARML_WORKER_NAME=host-gpu0b clearml-agent daemon --gpus 0 clearml/fractional-gpu:u22-cu12.3-2gb
```...
Hi GrotesqueDog77
What do you mean by share resources? Do you mean compute or storage?
Okay, let me see...
Hi @<1729309120315527168:profile|ShallowLion60>
Clearml in our case installed on k8s using helm chart (version: 7.11.0)
It should be done "automatically", I think there is a configuration var in the helm chart to configure that.
What urls are you urls seeing now, and what should be there?
was thinking that would delete the old weights from the file server once they get updated,
If you are uploading it to the same Task, make sure the model name and the filename is the same and it will override it (think filesystem filenames)
but they are still there, consuming space. Is this the expected behavior? How can I get rid of those old files?
you can programatically also remove (delete) models None