Reputation
Badges 1
25 × Eureka!GreasyPenguin14 whats the clearml version you are using, OS & Python ?
Notice this happens on the "connect_configuration" that seems to be called after the Task was closed, could that be the case ?
Would this be equivalent to an automated job submission from clearml to the cluster?
yes exactly
I am looking for a setup which allows me to essentially create the workers and start the tasks from a slurm script
hmm I see, basically the slurm Admins are afraid you will create a script the clogs the SLURM cluster, hence no automated job submission, so you want to use slurm as a "time on cluster" and then when your time is allocated, use clearml for the job submission, is that cor...
Hey, is it possible for me to upload a pdf as an artefact?
Sure, just point to the file and it will upload it for you π
@<1569496075083976704:profile|SweetShells3> remove these from your pbtext:
name: "conformer_encoder"
platform: "onnxruntime_onnx"
default_model_filename: "model.bin"
Second, what do you have in your preprocess_encoder.py ?
And where are you getting the Error? (is it from the triton container? or from the Rest request?
Hi EagerOtter28
The agent knows how to do the http->ssh conversion on the fly, in your cleaml.conf (on the agent's machine) set force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/42606d9247afbbd510dc93eeee966ddf34bb0312/docs/clearml.conf#L25
and do you have import tensorflow in your code?
Out of curiosity, what ended up being the issue?
So in a simple "all-or-nothing"
Actually this is the only solution unless preemption is supported, i.e. abort running Task to free-up an agent...
There is no "magic" solution for complex multi-node scheduling, even SLURM will essentially do the same ...
Hi @<1547028074090991616:profile|ShaggySwan64>
. If I have a local repo cloned with ssh, the agent will attempt to replace the repo url with https,
Yes if you provide git user/pass (or user / app-pass) the agent would automatically replace and ssh:// repo link with the equivalent https:// and user the user/pass for authentication
but it seems that it doesn't remove the 2222 port in my case. That leads to
Hmm,,, what's the clearml-agent version? if this is not the latest 2.0.0r...
But first I want to make sure the verify argument is actually used, hence False
That would be great! Might have to useΒ
2>/dev/null
Β in some of my bash scripts
Feel free to test and PR :)
One other question regarding connecting. We have setup sshd inside the docker image we are using.
Actually the remote session opens port 10022 on the host machine (so it does not collide with the default ssh port)
It actually runs an additional sshd inside the docker, setting its port.
And the clearml-session will ssh directly into the container sshd...
Okay verified, it won't work with the demo server. give me a minute π
GrittyKangaroo27 any chance you can open a GitHub issue so this is not forgotten ?
(btw: we I think 1.1.6 is going to be released later today, then we will have a few RC with improvements on the pipeline, I will make sure we add that as well)
Hi @<1547028074090991616:profile|ShaggySwan64>
I'm guessing just copying the data folder with rsync is not the most robust way to do that since there can be writes into mongodb etc.
Yep
Does anyone have experience with something like that?
basically you should just backup the 3 DBs (mongo, redis, elastic) each one based on their own backup workflows. Then just rsync the files server & configuration.
How about this one:
None
copy paste the trains.conf from any machine, it just need the definition of the trains-server address.
Specifically if you run in offline mode, there is no need for the trains.conf and you can just copy the one on GitHub
BTW, this one seems to work ....
` from time import sleep
from clearml import Task
Task.set_offline(True)
task = Task.init(project_name="debug", task_name="offline test")
print("starting")
for i in range(300):
print(f"{i}")
sleep(1)
print("done") `
@<1524922424720625664:profile|TartLeopard58> @<1545216070686609408:profile|EnthusiasticCow4>
Notice that when you are spinning multiple agents on the same GPU, the Tasks should request the "correct" fractional GPU container, i.e. if they pick a "regular" no mem limit.
So something like
CLEARML_WORKER_NAME=host-gpu0a clearml-agent daemon --gpus 0 clearml/fractional-gpu:u22-cu12.3-2gb
CLEARML_WORKER_NAME=host-gpu0b clearml-agent daemon --gpus 0 clearml/fractional-gpu:u22-cu12.3-2gb
```...
Hi GrotesqueDog77
What do you mean by share resources? Do you mean compute or storage?
Okay, let me see...
Hi @<1729309120315527168:profile|ShallowLion60>
Clearml in our case installed on k8s using helm chart (version: 7.11.0)
It should be done "automatically", I think there is a configuration var in the helm chart to configure that.
What urls are you urls seeing now, and what should be there?
was thinking that would delete the old weights from the file server once they get updated,
If you are uploading it to the same Task, make sure the model name and the filename is the same and it will override it (think filesystem filenames)
but they are still there, consuming space. Is this the expected behavior? How can I get rid of those old files?
you can programatically also remove (delete) models None
with conda ?!
Done HandsomeCrow5 +1 added π
btw: if you feel you can share how your reports looks like (screen shot is great), this will greatly help in supporting this feature , thanks
β¦every user in the server has the same credentials, and they donβt need to know them..makes sense?
Make sense, single credentials for everyone, without the need to distribute
Is that correct?
Not really π
Everyone can do everything, the idea is sharability and accessibility.
I do know that in the paid tier they have full access control roles SSO etc, but unfortunately its way too complicated for the open-source.
Basically what I'm saying is trust your fellow colleagues π
but now sinceΒ
Task.current_task()
Β doesn't work on the pipeline object we have a serious problem
How is that possible ?
Is there a small toy code that can reproduce it ?