Reputation
Badges 1
25 × Eureka!Thanks!
Hmm from here : None
Could it be you do not have privileges to the resource, or that you did not provide credentials ?
Did that autoscaler work before ?
Thanks for the logs @<1627478122452488192:profile|AdorableDeer85>
Notice that the log you attached means the preprocessing is executed and the GPU backend is returning an error.
Could you provide the log of the docker compose specifically the intersting part is the Triton container, I want to verify it loads the model properly
AttributeError: 'NoneType' object has no attribute 'base_url'
can you print the model object ?
(I think the error is a bit cryptic, but generally it might be the model is missing an actual URL link?)print(model.id, model.name, model.url)
TroubledHedgehog16 generally speaking you can expect about 10 api calls per minute if you have many reports, and about 3 per minute on low report. We just optimized the sdk so in cases there are lots of consequential reports they are better batched, I would recommend the latest RC
Hi RoundMosquito25
This is a bit old but probably a good start:
https://clear.ml/blog/stacking-up-against-the-competition/
tl;dr
ClearML advantages (at least a few I can think of)
Scales way better Enables out of the box experiment orchestration (i.e. remote execution etc) Data management Nicer UI Full RestAPI Full MLops platform Model serving Query-able model repositoryProbably more π
this is very odd, can you post the log?
if fails duringΒ
add_step
Β stage for the very first step, becauseΒ
task_overrides
Β contains invalid keys
I see, yes I guess it it makes sense to mark the pipeline as Failed π
Could you add a GitHub issue on this behavior, so we do not miss it ?
SmarmySeaurchin8
When running in "dev" mode (i.e. writing the code) only packages imported directly are registered under "installed packages" , then when the agent is executing the experiment, it will update back the entire environment (including derivative packages etc.)
That said you can set detect_with_pip_freeze to true (in trains.conf) and it will basically store the entire pip freeze.
https://github.com/allegroai/trains/blob/f8ba0495fb3af1f99732fdffbbccd2fa992934a4/docs/trains.c...
GiganticTurtle0 is it in the same repository ?
If it is it should have detected the fact that it needs to analyze the entire repository (not just the standalone script, and then discover tensorflow)
Long story short, the Task requirements are async, so if one puts it after creating the object (at least in theory), it might be too late.
Make sense ?
Meanwhile check CreateFromFunction(object).create_task_from_function(...)
It might be better suited than execute remotely for your specific workflow π
FlutteringWorm14 Can you verify that even with the clearml.conf it has no effect?
RoundMosquito25 are you using clearml-agent daemon --stop or are you killing them ?
killing them basically means you loose them in the UI when they timeout, the backend does not see them for 10min so it assumes they died, when you call clearml-agent --stop they will unregister themselves and disappear immortally
Hi SteadyFox10 the way it works is that Trains limits the debug image history by reusing the same files names, so the UI will only present the iterations where the debug images are relevant for. With your sample code it looks like it exposes a bug , the generated link should contain iteration number, it does not and so it overwrites the debug images every iteration. Here is the image link: https://demofiles.trains.allegro.ai/Test/test_images.6ed32a2b5a094f2da47e6967bba1ebd0/metrics/Test/te...
Guys, any chance you can verify the RC solves the issue?pip install clearml==1.0.2rc0
One suggestion is to make sure all agents have the same configuration. Another is to add pip into the "installed packages" section.
(Notice that in the next release we will specifically include it there, to avoid these kind of scenarios)
TrickySheep9 is this a conda package or a wheel you are installing manually ?
The other way around- "8011:8008"
Is gpu_0_utilization also in % then?
Correct π
I was trying to find, what are those min and max value for above metrics.
Oh that makes sense, notice that you can get the values over time, so you can track the usage over the experiment lifetime (you can of course see it in the Scalar tab of the experiment)
How much free RAM / disk do you have there now? How's the CPU utilization ? how many Tasks are working with this machine at the same time
What do you mean by "tag" / "sub-tags"?
Omg that's a lot of submodules!
It has nothing with what the tasks sees if you are inside a git repo you will have to cone it on the remote machine. Let me check in the code maybe you have a workaround
it fails but with COMPLETED status
Which Task is marked "completed" the pipeline Task or the Step ?