Reputation
Badges 1
979 × Eureka!Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
wow if this works thatโs amazing
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)
I see what I described in https://allegroai-trains.slack.com/archives/CTK20V944/p1598522409118300?thread_ts=1598521225.117200&cid=CTK20V944 :
randomly, one of the two experiments is shown for that agent
I will go for lunch actually ๐ back in ~1h
I want to make sure that an agent did finish uploading its artifacts before marking itself as complete, so that the controller does not try to access these artifacts while they are not available
No, I want to launch the second step after the first one is finished and all its artifacts are uploaded
Yes ๐ Thanks!
Yes! not a strong use case though, rather I wanted to ask if it was supported somehow
I specified a torch @
https://download.pytorch.org/whl/cu100/torch-1.3.1%2Bcu100-cp36-cp36m-linux_x86_64.whl and it didn't detect the link, it tried to install latest version: 1.6.0
Is it because I did not specify --gpu 0
that the agent, by default pulls one experiment per available GPU?
TimelyPenguin76 That sounds amazing! will there be a fallback mechanism as well? often p3.2xlarge are on shortage, would be nice to define one resources req as first choice (eg. p3.2xlarge) -> if not available -> use another resources req (eg. g4dn)
Ok, in that case it probably doesnโt work, because if the default value is 10 secs, it doesnโt match what I get in the logs of the experiment: every second the tqdm adds a new line
trains-agent-1: runs an experiment for a long time (>12h). Picks a new experiment on top of the long one running trains-agent-2: runs only one experiment at a time, normal trains-agent-3: runs only one experiment at a time, normalIn total: 4 experiments running for 3 agents
SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?
So the migration from one server to another + adding new accounts with password worked, thanks for your help!
Hey @<1523701205467926528:profile|AgitatedDove14> , Actually I just realised that I was confused by the fact that when the task is reset, because of the sorting it disappears, making it seem like it was deleted. I think it's a UX issue: When I click on reset.
- The pop shows "Deleting 100%"
- The task disappears in the list of tasks because of the sortingThis led me to thing that there was a bug and the task was deleted
Hi SuccessfulKoala55 , there it is > https://github.com/allegroai/clearml-server/issues/100
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
In my github action, I should just have a dummy clearml server and run the task there, connecting to this dummy clearml server
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging