Reputation
Badges 1
25 × Eureka!(So it re-reads the configuration file)
Hi ClumsyElephant70
So do you need both requirements.txt combined ?
How will the agent be able to reproduce both repo on the remote machine ?
Hi DilapidatedCow43
I'm assuming the returned object cannot be pickled (which is ClearML's way of serializing it)
You can upload it as a model with
` uploaded_model_url = Task.current_task().update_output_model(model_path="/path/to/local/model")
...
return uploaded_model_url `wdyt?
using only a subset of the features
ShallowGoldfish8 if you have some parameter that controls it (i.e. select different features) then you can launch it with two sets f parameters.
Am I missing something?
for example:
` my_features_select = {"type": "set_a"}
Task.current_task().connect(my_features_select)
if my_features_select["type"] == "set_a":
do something
else
do something else `wdyt?
That's not possible, right?
That's actually what the "start_locally" does, but the missing part is starting it on another machine without the agent (I mean it totally doable, and if important I can explain how, but this is probably not what you are after)
I really need to have a dummy experiment pre-made and have the agent clone the code, set up the env and run everything?
The agent caches everything, and actually can also just skip installing the env entirely. which would mean ...
TrickyRaccoon92
I guess elegant is the challenge π
What exactly is the use case ?
That is quite neat! You can also put a soft link from the main repo to the submodule for better visibility
Hmm ElegantKangaroo44 low memory that might explain the behavior
BTW: 1==stop request, 3=Task Aborted/Failed
Which makes sense if it crashed on low memory...
Hi FiercePenguin76
Maybe it makes sense to use
schedule_function
I think you are correct. This means the easiest would be to schedule a function, and have that function do the Task cloning/en-queuing. wdyt?
As a side note , maybe we should have the ability of custom function that Returns a task ID. the main difference is that the Task ID that was created will be better logged / visible (as opposed to the schedule_function, where the fact there was a Task that was created / ...
Specifically for this one, this is the auto generated docstring from the actual code, so PR to the
https://github.com/allegroai/clearml/blob/e53a76b713910adaf87578c69e86f8154d4ab4c1/clearml/logger.py#L152
Oh I see, that kind of make sense
I think this is the section you should use:
None
But instead of the clearml-services container you should use the regular container (or just have it installed as part of the entry-point on any ubuntu based container)
Notice the important parts here are:
[None](https://github.com/allegroai/clearml-server/blob/6a1fc04d1e8b112fb334c8743d...
? Do you have a link how to setup a task scheduler to run in service mode in k8s?
basically spin the agent pod and add an argument to the agent itself (this is the --service-mode)
https://clear.ml/docs/latest/docs/clearml_agent#services-mode
The problem is that clearml installsΒ
cudatoolkit=11.0
Β butΒ
cudatoolkit=11.1
Β is needed.
You suggested this fix earlier, but I am not sure why it didnt work then.
Hmm , could you test with the clearml-agent 0.17.2 ? making surethis actually solves the problem
Hi MammothGoat53
Do you mean working with RestAPI directly?
https://clear.ml/docs/latest/docs/references/api/events
Okay could you test with export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/.singularity.d/libs/
I wonder if the try/except approach would work for XGboost load, could we just try a few classes one after the other?
Thread is discussed here: None
Hi @<1645597514990096384:profile|GrievingFish90>
You mean the agent itself inside a docker then the agent spins sibling dockers for the Tasks ?
The release was supposed to be out this week, got delayed by some py2 support issue, anyhow the release will be almost exactly like the latest we now have on the GitHub repo (and I'm assuming it will be out just after the weekend)
the issue was related to task.connect being called multiple times I guess.
This is odd?! how would that effect the crash?
Do notice that when you connect objects, each time you call connect you are basically deserializing the configuration from the backend into the code, maybe this somehow effected the object?
Hmm seems like everything is working, can you check in the UI if you see the serving session ID in the DevOps project? maybe there are two, and you configured one an dthe docker-compose is running another ?
Hi IrritableGiraffe81
Yes it deploys all ClearML (including web).
ClearML-serving unfortunately is a bit more complicated to spin, as it needs actual compute nodes.
That said we are working on making it a lot easier π
In my understanding requests still go through
clearml-server
which configuration I left
DefiantHippopotamus88 actually this is Not correct.
clearml-server only acts as a control plane, no actual requests are routed to it, it is used to sync model state, stats etc. not part of the request processing flow itself.curl: (56) Recv failure: Connection reset by peer
This actually indicates 9090 port is not being listened to...
What's the final docker-compose you are usi...
it looks like nvidia is going to come up with an UI for TAO too
Interesting, any reference we could look at ?
Hmm interesting, I guess once you are able to connect it with ClearML you can just clone / modify / enqueue and let users train models directly from the UI on any hardware, is that the plan ?
Hi EagerOtter28
The agent knows how to do the http->ssh conversion on the fly, in your cleaml.conf (on the agent's machine) set force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/42606d9247afbbd510dc93eeee966ddf34bb0312/docs/clearml.conf#L25