Reputation
Badges 1
25 × Eureka!If you one each "main" process as a single experiment, just don't call Task.init in the scheduler
Can you fix locally, just to verify ?
Yes please 🙂
BTW: I originally thought the double quotes (in your PR) were also a bug, this is why I was asking, wdyt?
Shouldn't this be a real value and not a template
you mean value being pulled to the pod that failed ?
(once you verify PR the fix, I'll make sure it is merged)
MagnificentSeaurchin79
Do notice that the pipeline controller assumes you have an agent running
Oh task_id is the Task ID of step 2.
Basically the idea is, you run your code once (lets call it debugging / programming), that run creates a task in the system, the task stores the environment definition and the arguments used. Then you can clone that Task and launch it on another machine using the Agent (that basically will setup the environment based on the Task definition and will run your code with the new arguments). The Pipeline is basically doing that for you (i.e. cloning a task chan...
Hi SmugOx94
Hmm are you creating the environment manually, or is it done by Task.init ?
(Basically Task.init will store the entire environment of conda, and if the agent is working with conda package manager it will use it to restore it)
https://github.com/allegroai/clearml-agent/blob/77d6ff6630e97ec9a322e6d265cd874d0ab00c87/docs/clearml.conf#L50
(This code sample should work on your setup with your installed packages without a problem)
Not really 😞
Everyone can do everything, the idea is sharability and accessibility.
I do know that in the paid tier they have full access control roles SSO etc, but unfortunately its way too complicated for the open-source.
Basically what I'm saying is trust your fellow colleagues 🙂
Hi @<1610083503607648256:profile|DiminutiveToad80>
This sounds like the wrong container ? I think we need some more context here
Sen the full Task log, you can DM it if it is easier
This points to the wrong cu117 / driver - could that be?
Hi SmallDeer34
Did you call Task.init ?
can you see these metric on TB ?
I'm just trying to see what is the default server that is set, and is it responsive
I'm assuming you mean your own server, not the demo server, is that correct ?
and then second part is to check if it is up and alive
Yes, you can curl
to the ping endpoint :
https://clear.ml/docs/latest/docs/references/api/debug#post-debugping
but when I removed output_uri from Task.init, the pickled model has path
When you run the job on the k8s pod?
I think that what you need is to create an OutputModel , then call update weights file when you have the better model, this will also allow you to tag the model object. Would that help? Or would it make sense to use Task.models and count on the auto logging?
VexedCat68 are you manually creating the OutputModel object?
Thank you @<1689446563463565312:profile|SmallTurkey79> !!!
Check on which queue the HPO puts the Tasks, and if the agent is listening to these queues
Out of curiosity, if Task flush worked, when did you get the error, at the end of the process ?
Hi ScantChimpanzee51
btw: this seems like an S3 internal error
https://github.com/boto/s3transfer/issues/197
but DS in order for models to be uploaded,
you still have to set:
output_uri=True
in the
No, if you set the default_output_uri, there is no need to pass output_uri=True
in the Task.init()
🙂
It is basically setting it for you, make sense ?
when I run it on my laptop...
Then yes, you need to set the default_output_uri
on Your laptop's clearml.conf (just like you set it on the k8s glue)
Make sense ?
thought the agent created a new conda env and installed all packages
It does, but I was asking what is written on the Original Task (the one created when you executed the code on your laptop, not when the agent was executing it, when the agent is executing the Task, it writes back All the packages of the entire venv it created, when the Task is run manually, it will list only the packages you import directly (i.e. from package or import package, it actually analyses the code)
My point...
SparklingHedgehong28 this is actually quite cool! Still not sure why not just use the built in autoscaler https://github.com/allegroai/clearml/tree/master/examples/services/aws-autoscaler , but it is a really cool usage of ASG 🤩