Reputation
Badges 1
979 × Eureka!So I need to have this merging of small configuration files to build the bigger one
This allows me to inject yaml files into other yaml files
so that any error that could arise from communication with the server could be tested
I think the best case scenario would be that ClearML maintains a github action that sets up a dummy clearml-server, so that anyone can use it as a basis to run their tests, so that they just have to change to URL of the server to the local one executed in the github action and they can test seamlessly all their code, wdyt?
even if I move the Github workers internally where they could have access to the prod server, I am not sure I would like that, because it would pile up test data in the prod server that is not necessary
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
wow if this works thatโs amazing
Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)
I see what I described in https://allegroai-trains.slack.com/archives/CTK20V944/p1598522409118300?thread_ts=1598521225.117200&cid=CTK20V944 :
randomly, one of the two experiments is shown for that agent
I will go for lunch actually ๐ back in ~1h
I want to make sure that an agent did finish uploading its artifacts before marking itself as complete, so that the controller does not try to access these artifacts while they are not available
No, I want to launch the second step after the first one is finished and all its artifacts are uploaded
Yes ๐ Thanks!
Yes! not a strong use case though, rather I wanted to ask if it was supported somehow
I specified a torch @
https://download.pytorch.org/whl/cu100/torch-1.3.1%2Bcu100-cp36-cp36m-linux_x86_64.whl and it didn't detect the link, it tried to install latest version: 1.6.0
Is it because I did not specify --gpu 0
that the agent, by default pulls one experiment per available GPU?
TimelyPenguin76 That sounds amazing! will there be a fallback mechanism as well? often p3.2xlarge are on shortage, would be nice to define one resources req as first choice (eg. p3.2xlarge) -> if not available -> use another resources req (eg. g4dn)
Ok, in that case it probably doesnโt work, because if the default value is 10 secs, it doesnโt match what I get in the logs of the experiment: every second the tqdm adds a new line
trains-agent-1: runs an experiment for a long time (>12h). Picks a new experiment on top of the long one running trains-agent-2: runs only one experiment at a time, normal trains-agent-3: runs only one experiment at a time, normalIn total: 4 experiments running for 3 agents
SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?