Okay verified, it won't work with the demo server. give me a minute 🙂
GrittyKangaroo27 any chance you can open a GitHub issue so this is not forgotten ?
(btw: we I think 1.1.6 is going to be released later today, then we will have a few RC with improvements on the pipeline, I will make sure we add that as well)
Hi @<1547028074090991616:profile|ShaggySwan64>
I'm guessing just copying the data folder with rsync is not the most robust way to do that since there can be writes into mongodb etc.
Yep
Does anyone have experience with something like that?
basically you should just backup the 3 DBs (mongo, redis, elastic) each one based on their own backup workflows. Then just rsync the files server & configuration.
How about this one:
None
copy paste the trains.conf from any machine, it just need the definition of the trains-server address.
Specifically if you run in offline mode, there is no need for the trains.conf and you can just copy the one on GitHub
BTW, this one seems to work ....
` from time import sleep
from clearml import Task
Task.set_offline(True)
task = Task.init(project_name="debug", task_name="offline test")
print("starting")
for i in range(300):
print(f"{i}")
sleep(1)
print("done") `
@<1524922424720625664:profile|TartLeopard58> @<1545216070686609408:profile|EnthusiasticCow4>
Notice that when you are spinning multiple agents on the same GPU, the Tasks should request the "correct" fractional GPU container, i.e. if they pick a "regular" no mem limit.
So something like
CLEARML_WORKER_NAME=host-gpu0a clearml-agent daemon --gpus 0 clearml/fractional-gpu:u22-cu12.3-2gb
CLEARML_WORKER_NAME=host-gpu0b clearml-agent daemon --gpus 0 clearml/fractional-gpu:u22-cu12.3-2gb
```...
Hi GrotesqueDog77
What do you mean by share resources? Do you mean compute or storage?
Okay, let me see...
Hi @<1729309120315527168:profile|ShallowLion60>
Clearml in our case installed on k8s using helm chart (version: 7.11.0)
It should be done "automatically", I think there is a configuration var in the helm chart to configure that.
What urls are you urls seeing now, and what should be there?
was thinking that would delete the old weights from the file server once they get updated,
If you are uploading it to the same Task, make sure the model name and the filename is the same and it will override it (think filesystem filenames)
but they are still there, consuming space. Is this the expected behavior? How can I get rid of those old files?
you can programatically also remove (delete) models None
with conda ?!
Done HandsomeCrow5 +1 added 🙂
btw: if you feel you can share how your reports looks like (screen shot is great), this will greatly help in supporting this feature , thanks
…every user in the server has the same credentials, and they don’t need to know them..makes sense?
Make sense, single credentials for everyone, without the need to distribute
Is that correct?
Not really 😞
Everyone can do everything, the idea is sharability and accessibility.
I do know that in the paid tier they have full access control roles SSO etc, but unfortunately its way too complicated for the open-source.
Basically what I'm saying is trust your fellow colleagues 🙂
but now since
Task.current_task()
doesn't work on the pipeline object we have a serious problem
How is that possible ?
Is there a small toy code that can reproduce it ?
If you create an initial code base maybe we can merge it?
works seamlessly throughout and in our current on premise servers...
I'm assuming via something close to what I suggested above with .netrc ?
Sigint (ctrl c) only
Because flushing state (i.e. sending request) might take time so only when users interactively hit ctrl c we do that. Make sense?
Hi @<1546303293918023680:profile|MiniatureRobin9> could it be the pipeline logic is created via the clrarml-task CLI? If this is the case, I think this is an edge case we should fix. Basically it creates a Task instead of pipeline, which in.essence only effects the UI. To solve it, just run the pipeline locally, notice that by default when you start it, it will actually stop the local run and relaunch itself on an agent.
Also, could you open a GitHub issue so we add a flag for it?
Hi @<1689446563463565312:profile|SmallTurkey79>
This call is to set an existing (already created Task's requirements). Since it was just created it waits for the automatic package detection before overriding it.
What you want is " Task.force_requirements_env_freeze " (notice Class level, that need to be called Before Task.init)
Task.force_requirements_env_freeze(requirements_file="requirements.txt")
task = Task.init(...)
PlainSquid19 I will also look into it as well.
maybe for some reason model.keras_model.save_weights is not caught ...
SweetGiraffe8 no need to import it, any report to TB is automatically logged by ClearML 🙂
Hi @<1523708920831414272:profile|SuperficialDolphin93>
The error seems like nvml fails to initialize inside the container, you can test it with nvidia-smi and check if that wirks
Regrading Cuda version the ClearML serving inherits from the Triton container, could you try to build a new one with the latest Triton container (I think 25). The docker compose is in the cleaml serving git repo. wdyt?
Where would I put these credentials? I don't want to expose them in the logs as environmental variable or hard code them.
Hi GleamingGrasshopper63
So basically you need a vault, to store those credentials...
Unfortunately the open-source version does not contain vault support, but the paid tiers scale/enterprise do.
There you can have an environment variable defined in the vault, that each time the agent runs your code, it will pull it from the vault and set it on your process. wdyt ?
os.system
Yes that's the culprit, it actually runs a new process and clearml assumes that there are no other scripts in the repository that are used, so it does not analyze them
A few options:
Manually add the missing requirement Task.add_requirements('package_name')make sure you call it before the Task.init
2. import the second script from the first script. This will tell clearml to analyze it as well.
3. Force the entire clearml to analyze the whole repository: https://g...