Reputation
Badges 1
22 × Eureka!I'm not talking about node failure, rather pod failure, which is out-of-memory in 99% of the cases.
Thanks @<1806497735218565120:profile|BrightJellyfish46>
@<1523701070390366208:profile|CostlyOstrich36> they don't as the pod is killed as soon as the process inside oversteps the memory limit
Here’s how I do it using clearml.conf config for my agent:
sdk {
aws {
s3 {
...
}
}
development {
default_output_uri: "
"
}
}
@<1523701087100473344:profile|SuccessfulKoala55> my colleague submitted a pipeline whose component was referencing a non-existent queue. The queue doesn't actually exist, that's the issue. The "default" queue that handles the controller task just started to output error messages saying that this component can't be scheduled due to missing queue. We just want a way to fail early if a queue doesn't exist, instead of a pipeline running indefinitely without actually failing.
no worries @<1523701205467926528:profile|AgitatedDove14>
Hello @<1523703097560403968:profile|CumbersomeCormorant74> , I found your name on the company website, you're the VP of Engineering if I'm not mistaken? I wanted to directly ask you, since I'm having trouble reaching engineers on GitHub. What is your policy & process for OSS contributions? My team is a heavy user, and we occasionally find things to improve, but the experience for contributions hasn't been great so far. Thanks for making ClearML open-source!
Hey @<1523701070390366208:profile|CostlyOstrich36> , could you provide any suggestions here, please?
the components start hanging indefinitely right after printing Starting Task Execution
@<1576381444509405184:profile|ManiacalLizard2> but the task controller has access to that information. Before deleting the pod, it could retrieve the exit code and status message that all pods provide, and log it under "Info" section in ClearML.
Hey! That sounds reassuring, thanks for the response. BTW, I didn’t mean to criticize your engineers or anything, I can see they work very hard. Kudos to them.
Huh, I see. Thanks for your answers. How difficult would it be to implement some way to automatically inferring repository information for components, or having a flag repo_inherit (or similar) when defining a component (which would inhering repository information from the controller)? My workflow is based around executing code that lives in the same repository, so it’s cumbersome having to specify repository information all over the place, and changing commit hash as I add new code.
@<1523701205467926528:profile|AgitatedDove14> for me it hasn’t worked when I specified agentk8sglue.queue: "queue1,queue2" in the Helm chart options which should be possible according to documentation. What also hasn’t worked is that flag for creating a queue if it doesn’t exists ( agentk8sglue.createQueueIfNotExists ). Both failed parsing at runtime, so those are 2 bugs I’d say.
Deployment is using k8s ( docker.io/allegroai/clearml:2.0.0-613 )
Yes, that seems like an option as well. I found this as well (in case someone looks for it in the future):
p = PipelineDecorator.get_current_pipeline()
p.get_running_nodes()
Logging the pod exit code and status message would be very useful, before deleting the pod. The data scientists would see that an OOM happened and they wouldn't bother other teams to see what happened.
The way I understand it:
- if you’re executing tasks locally (e.g. on your laptop) then you need this setting because the
clearmlpackage needs to know where to upload artifacts (artifacts aren’t proxied through theclearml-serverthey are rather uploaded directly to the storage of your choice) - if you’re executing code using ClearML agent, then you can configure agent the way I wrote earlier, and it will use your MinIO instance for uploading artifacts for all of the tasks it executes
Any ideas @<1523701087100473344:profile|SuccessfulKoala55> ?
I don’t use datasets so I don’t know, sorry, maybe @<1523701087100473344:profile|SuccessfulKoala55> can help
when I add repo="." to definition of all my component decorators it works (but not the pipeline decorator), but it doesn’t work without that part… the problem i’m having now is that my components hang when executed in the cluster… i have 2 agents deployed (default and services queues)
Yes @<1523701070390366208:profile|CostlyOstrich36>
Awesome @<1729671499981262848:profile|CooperativeKitten94> , will definitely add that. It would also be very helpful if there was a way to delay deleting "completed/failed" pods. This is useful when something fails unexpectedly and ClearML logs are not enough to debug the issue. Does that make sense to you? I could contribute to your codebase if you're interested.
I think so, but haven’t investigated what is the problem exactly, I’ll report it though.
This hasn’t worked for me either, I use multiple queues instead. Another reason I also use multiple queues is because I need to specify different resource requirements for pods launched by each queue (CPU-only vs GPU).
@<1523701205467926528:profile|AgitatedDove14> I managed to fix the issue FYI. I replaced from clearml import PipelineDecorator with from clearml.automation.controller import PipelineDecorator and it suddenly works. What a weird issue.