Reputation
Badges 1
22 × Eureka!@<1523701205467926528:profile|AgitatedDove14> for me it hasn’t worked when I specified agentk8sglue.queue: "queue1,queue2" in the Helm chart options which should be possible according to documentation. What also hasn’t worked is that flag for creating a queue if it doesn’t exists ( agentk8sglue.createQueueIfNotExists ). Both failed parsing at runtime, so those are 2 bugs I’d say.
I'm not talking about node failure, rather pod failure, which is out-of-memory in 99% of the cases.
@<1576381444509405184:profile|ManiacalLizard2> but the task controller has access to that information. Before deleting the pod, it could retrieve the exit code and status message that all pods provide, and log it under "Info" section in ClearML.
Hey @<1523701070390366208:profile|CostlyOstrich36> , could you provide any suggestions here, please?
Any ideas @<1523701087100473344:profile|SuccessfulKoala55> ?
Here’s how I do it using clearml.conf config for my agent:
sdk {
aws {
s3 {
...
}
}
development {
default_output_uri: "
"
}
}
Thanks @<1806497735218565120:profile|BrightJellyfish46>
Awesome @<1729671499981262848:profile|CooperativeKitten94> , will definitely add that. It would also be very helpful if there was a way to delay deleting "completed/failed" pods. This is useful when something fails unexpectedly and ClearML logs are not enough to debug the issue. Does that make sense to you? I could contribute to your codebase if you're interested.
The way I understand it:
- if you’re executing tasks locally (e.g. on your laptop) then you need this setting because the
clearmlpackage needs to know where to upload artifacts (artifacts aren’t proxied through theclearml-serverthey are rather uploaded directly to the storage of your choice) - if you’re executing code using ClearML agent, then you can configure agent the way I wrote earlier, and it will use your MinIO instance for uploading artifacts for all of the tasks it executes
I don’t use datasets so I don’t know, sorry, maybe @<1523701087100473344:profile|SuccessfulKoala55> can help
Hey! That sounds reassuring, thanks for the response. BTW, I didn’t mean to criticize your engineers or anything, I can see they work very hard. Kudos to them.
Yes, that seems like an option as well. I found this as well (in case someone looks for it in the future):
p = PipelineDecorator.get_current_pipeline()
p.get_running_nodes()
@<1523701070390366208:profile|CostlyOstrich36> they don't as the pod is killed as soon as the process inside oversteps the memory limit
Deployment is using k8s ( docker.io/allegroai/clearml:2.0.0-613 )
Yes @<1523701070390366208:profile|CostlyOstrich36>
I know I can configure the “pod template”, but I’m looking for a solution where users can set their own variables without modifying Kubernetes secrets.
This hasn’t worked for me either, I use multiple queues instead. Another reason I also use multiple queues is because I need to specify different resource requirements for pods launched by each queue (CPU-only vs GPU).
Logging the pod exit code and status message would be very useful, before deleting the pod. The data scientists would see that an OOM happened and they wouldn't bother other teams to see what happened.
@<1523701087100473344:profile|SuccessfulKoala55> my colleague submitted a pipeline whose component was referencing a non-existent queue. The queue doesn't actually exist, that's the issue. The "default" queue that handles the controller task just started to output error messages saying that this component can't be scheduled due to missing queue. We just want a way to fail early if a queue doesn't exist, instead of a pipeline running indefinitely without actually failing.
Hello @<1523703097560403968:profile|CumbersomeCormorant74> , I found your name on the company website, you're the VP of Engineering if I'm not mistaken? I wanted to directly ask you, since I'm having trouble reaching engineers on GitHub. What is your policy & process for OSS contributions? My team is a heavy user, and we occasionally find things to improve, but the experience for contributions hasn't been great so far. Thanks for making ClearML open-source!
Huh, I see. Thanks for your answers. How difficult would it be to implement some way to automatically inferring repository information for components, or having a flag repo_inherit (or similar) when defining a component (which would inhering repository information from the controller)? My workflow is based around executing code that lives in the same repository, so it’s cumbersome having to specify repository information all over the place, and changing commit hash as I add new code.
I think so, but haven’t investigated what is the problem exactly, I’ll report it though.
no worries @<1523701205467926528:profile|AgitatedDove14>
when I add repo="." to definition of all my component decorators it works (but not the pipeline decorator), but it doesn’t work without that part… the problem i’m having now is that my components hang when executed in the cluster… i have 2 agents deployed (default and services queues)