Yes, I guess that's fine then - Thanks!
I donât have a registry to push my image to.I think I can get around it actually - Will it work if I just build the image locally once, then start the agent? Docker would recognise that image locally and just use it right? I wonât need to update that image often anyway
Ho nice, thanks for pointing this out!
Ho and also use the colors of the series. That would be a killer feature. Then I simply need to match the color of the series to the name to check the tags
Hi SuccessfulKoala55 , thanks for the idea! the function isnât called with atexit.register() though, maybe the way the agent kills the task is not supported by atexit
How exactly is the clearml-agent killing the task?
Yes AnxiousSeal95 , stopped instance meaning you donât pay for it, but just its storage, as described https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html . So AgitatedDove14 increasing the IDLE timeout would still make me pay for the instance while they are idle.
Do you get stopped instances instantely when you ask for them?
Well thatâs a good question, thatâs what I observed some time ago, but according to their https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/...
Nevertheless there might still be some value in that, because it would allow to reduce the starting time by removing the initial setup of the agent + downloading of the data to the instance - but not as much as I described initially, if instances stopped are bound to the same capacity limitations as new instances launched
No I agree, itâs probably not worth it
In the comparison the problem will be the same, right? If I choose last/min/max values, it wonât tell me the corresponding values for others metrics. I could switch to graphs, group by metric and look manually for the corresponding values, but that becomes quickly cumbersome as the number of experiments compared grow
Sure đ Opened https://github.com/allegroai/clearml/issues/568
Hi TimelyPenguin76 , any chance this was fixed? đ
AgitatedDove14 , my âuncommitted changesâ ends with... if __name__ == "__main__": task = clearml.Task.get_task(clearml.config.get_remote_task_id()) task.connect(config) run() from clearml import Task Task.init()
UnevenDolphin73 , task = clearml.Task.get_task(clearml.config.get_remote_task_id())
worked, thanks
AgitatedDove14 So I copied pasted locally the https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py from the examples of pytorch-ignite. Then I added a requirements.txt and called clearml-task
to run it on one of my agents. I adapted a bit the script (removed python-fire since itâs not yet supported by clearml).
and this works. However, without the trick from UnevenDolphin73 , the following wonât work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
So I guess the problem is that the following snippet:from clearml import Task Task.init()
Should be added before the if __name__ == "__main__":
?
I just move one experiment in another project, after moving it I am taken to the new project where the layout is then reset
Sorry, I refreshed the page and itâs gone đ
Sorry, I was actually able to fix it (using 1.1.3) not sure what was the problem đ
Also I can simply delete the /elastic_7 folder, I donât use it anymore (I have a remote ES cluster). In that case, I guess I would have enough space?
Ok, I am asking because I often see the autoscaler starting more instances than the number of experiments in the queues, so I guess I just need to increase the max_spin_up_time_min
I will try with that and keep you updated
Why would it solve the issue? max_spin_up_time_min
should be the param defining how long to wait after starting an instance, not polling_interval_time_min
, right?