
Reputation
Badges 1
19 × Eureka!when I add repo="."
to definition of all my component decorators it works (but not the pipeline decorator), but it doesn’t work without that part… the problem i’m having now is that my components hang when executed in the cluster… i have 2 agents deployed (default and services queues)
I think so, but haven’t investigated what is the problem exactly, I’ll report it though.
the components start hanging indefinitely right after printing Starting Task Execution
no worries @<1523701205467926528:profile|AgitatedDove14>
Huh, I see. Thanks for your answers. How difficult would it be to implement some way to automatically inferring repository information for components, or having a flag repo_inherit
(or similar) when defining a component (which would inhering repository information from the controller)? My workflow is based around executing code that lives in the same repository, so it’s cumbersome having to specify repository information all over the place, and changing commit hash as I add new code.
@<1523701205467926528:profile|AgitatedDove14> I managed to fix the issue FYI. I replaced from clearml import PipelineDecorator
with from clearml.automation.controller import PipelineDecorator
and it suddenly works. What a weird issue.
I know I can configure the “pod template”, but I’m looking for a solution where users can set their own variables without modifying Kubernetes secrets.
Awesome @<1729671499981262848:profile|CooperativeKitten94> , will definitely add that. It would also be very helpful if there was a way to delay deleting "completed/failed" pods. This is useful when something fails unexpectedly and ClearML logs are not enough to debug the issue. Does that make sense to you? I could contribute to your codebase if you're interested.
Here’s how I do it using clearml.conf
config for my agent:
sdk {
aws {
s3 {
...
}
}
development {
default_output_uri: "
"
}
}
@<1523701205467926528:profile|AgitatedDove14> for me it hasn’t worked when I specified agentk8sglue.queue: "queue1,queue2"
in the Helm chart options which should be possible according to documentation. What also hasn’t worked is that flag for creating a queue if it doesn’t exists ( agentk8sglue.createQueueIfNotExists
). Both failed parsing at runtime, so those are 2 bugs I’d say.
This hasn’t worked for me either, I use multiple queues instead. Another reason I also use multiple queues is because I need to specify different resource requirements for pods launched by each queue (CPU-only vs GPU).
I don’t use datasets so I don’t know, sorry, maybe @<1523701087100473344:profile|SuccessfulKoala55> can help
@<1576381444509405184:profile|ManiacalLizard2> but the task controller has access to that information. Before deleting the pod, it could retrieve the exit code and status message that all pods provide, and log it under "Info" section in ClearML.
The way I understand it:
- if you’re executing tasks locally (e.g. on your laptop) then you need this setting because the
clearml
package needs to know where to upload artifacts (artifacts aren’t proxied through theclearml-server
they are rather uploaded directly to the storage of your choice) - if you’re executing code using ClearML agent, then you can configure agent the way I wrote earlier, and it will use your MinIO instance for uploading artifacts for all of the tasks it executes
Deployment is using k8s ( docker.io/allegroai/clearml:2.0.0-613 )
@<1523701070390366208:profile|CostlyOstrich36> they don't as the pod is killed as soon as the process inside oversteps the memory limit
Logging the pod exit code and status message would be very useful, before deleting the pod. The data scientists would see that an OOM happened and they wouldn't bother other teams to see what happened.
I'm not talking about node failure, rather pod failure, which is out-of-memory in 99% of the cases.
Any ideas @<1523701087100473344:profile|SuccessfulKoala55> ?
Yes @<1523701070390366208:profile|CostlyOstrich36>
Hey @<1523701070390366208:profile|CostlyOstrich36> , could you provide any suggestions here, please?
Thanks @<1806497735218565120:profile|BrightJellyfish46>
Yes, that seems like an option as well. I found this as well (in case someone looks for it in the future):
p = PipelineDecorator.get_current_pipeline()
p.get_running_nodes()