Reputation
Badges 1
979 × Eureka!SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?
So the migration from one server to another + adding new accounts with password worked, thanks for your help!
Hey @<1523701205467926528:profile|AgitatedDove14> , Actually I just realised that I was confused by the fact that when the task is reset, because of the sorting it disappears, making it seem like it was deleted. I think it's a UX issue: When I click on reset.
- The pop shows "Deleting 100%"
- The task disappears in the list of tasks because of the sortingThis led me to thing that there was a bug and the task was deleted
Hi SuccessfulKoala55 , there it is > https://github.com/allegroai/clearml-server/issues/100
In my github action, I should just have a dummy clearml server and run the task there, connecting to this dummy clearml server
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
and then call task.connect_configuration probably
You mean you "aborted the task" from the UI?
Yes exactly
I'm assuming from the leftover processes ?
Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
yes in venv mode, I'll try with the latest version as well
@<1523701205467926528:profile|AgitatedDove14> I see other rc in pypi but no corresponding tags in the clearml-agent repo? are these releases legit?
What is latest rc of clearml-agent? 1.5.2rc0?
Downloading the artifacts is done only when actually calling get()/get_local_copy()
Yes, I rather meant: reproduce this behavior even for getting metadata on the artifacts π
That said, you might have accessed the artifacts before any of them were registered
I called task.wait_for_status() to make sure the task is done
SuccessfulKoala55 I am using ES 7.6.2
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog mem...
python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached --gpus 1 > ~/trains-agent.startup.log 2>&1
AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?
Hi SuccessfulKoala55 , thanks for the idea! the function isnβt called with atexit.register() though, maybe the way the agent kills the task is not supported by atexit
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
SuccessfulKoala55 I deleted all :monitor:machine
and :monitor:gpu
series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz
. I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?
Ho wow! is it possible to not specify a remote task? (If i am working with Task.set_offline(True))
is it different from Task.set_offline(True)?
I ended up dropping omegaconf altogether