Reputation
Badges 1
979 × Eureka!yea I just realized that you would also need to specify different subnets, etc… not sure how easy it is 😞 But it would be very valuable, on-demand GPU instances are so hard to spin up nowadays in aws 😄
ok, and if not the case, it will fall back to 3.8, right? Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
Sure, it’s because of a very annoying bug that I shared in this https://clearml.slack.com/archives/CTK20V944/p1648647503942759 , that I couldn’t solve so far.
I’m not sure you can downgrade that easily ...
Yea that’s what I thought, that’s a bit of pain for me now, I hope I can find a way to fix the bug somehow
I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
The task requires this service, so the task starts it on the machine - Then I want to make sure the service is closed by the task upon completion/failure/abortion
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
mmh it looks like what I was looking for, I will give it a try 🙂
There is a pinned github thread on https://github.com/allegroai/clearml/issues/81 , seems to be the right place?
Sure, I opened an issue https://github.com/allegroai/clearml/issues/288 unfortunately I don't have time to open a PR 🙏
Hi SuccessfulKoala55 , Yes it’s for the same host/bucket - I’ll try with a different browser
CostlyOstrich36 yes, when I scroll up, a new events.get_task_log is fired and the response doesn’t contain any log (but it should)
No, they have different names - I will try to update both agents to the latest versions
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didn’t change anything else
Hi CostlyOstrich36 , one more observation: it looks like when I don’t open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I don’t see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
CostlyOstrich36 , this also happens with clearml-agent 1.1.1 on a aws instance…
I am sorry to give infos that are not very precise, but it’s the best I can do - Is this bug happening only to me?
CostlyOstrich36 , actually this only happens for a single agent. The weird thing is that I have a machine with two gpus, and I spawn two agents, one per gpus. Both have the same version. For one, I can see all the logs, but not for the other
Notice the last line should not have
--docker
Did you meant --detached
?
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)
That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR 🙂
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
it also happens without hitting F5 after some time (~hours)
How about the overhead of running the training on docker on a VM?
I just move one experiment in another project, after moving it I am taken to the new project where the layout is then reset
CostlyOstrich36 Were you able to reproduce it? That’s rather annoying 😅
Hi AgitatedDove14 , that’s super exciting news! 🤩 🚀
Regarding the two outstanding points:
In my case, I’d maintain a client python package that takes care of the pre/post processing of each request, so that I only send the raw data to the inference service and I post process the raw output of the model returned by the inference service. But I understand why it might be desirable for the users to have these steps happening on the server. What is challenging in this context? Defining how t...
So I created a symlink in /opt/train/data -> /data