Reputation
Badges 1
979 × Eureka!SmugDolphin23 Actually adding agent.python_binary
didn't work, it was not read by the clearml agent (in the logs dumped by the agent, agent.python_binary =
(no value)
And so in the UI, in workers&queues tab, I see randomly one of the two experiments for the worker that is running both experiments
When installed with http://get.docker.com , it works
So it seems like it doesn't copy /root/clearml.conf and it doesn't pass the environment variables (CLEARML_API_HOST, CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY)
how would it interact with the clearml-server api service? would it be completely transparent?
In the comparison the problem will be the same, right? If I choose last/min/max values, it won’t tell me the corresponding values for others metrics. I could switch to graphs, group by metric and look manually for the corresponding values, but that becomes quickly cumbersome as the number of experiments compared grow
Ho and also use the colors of the series. That would be a killer feature. Then I simply need to match the color of the series to the name to check the tags
to pass secrets to each experiment
It failed as well
(Even if I explicitely do my_task.close() )
Nevertheless there might still be some value in that, because it would allow to reduce the starting time by removing the initial setup of the agent + downloading of the data to the instance - but not as much as I described initially, if instances stopped are bound to the same capacity limitations as new instances launched
yes, done! Is there something more to take into account than what I shared?
Yes AnxiousSeal95 , stopped instance meaning you don’t pay for it, but just its storage, as described https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html . So AgitatedDove14 increasing the IDLE timeout would still make me pay for the instance while they are idle.
Do you get stopped instances instantely when you ask for them?
Well that’s a good question, that’s what I observed some time ago, but according to their https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/...
Interesting - I can reproduce easily
Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...
(docker was install with sudo snap install docker
)
Could you please point me to the relevant component? I am not familiar with typescript unfortunately 😞
There is a pinned github thread on https://github.com/allegroai/clearml/issues/81 , seems to be the right place?
sure, will be happy to debug that 🙂
AgitatedDove14 I am actually considering rolling back to 1.1.0, so 1.3.0 is not really an option for now
not really, because it is in the middle of the controller task, there are other things to be done afterwards (retrieving results, logging new artifacts, creating new tasks, etc)
awesome 🎉
Maybe then we can extend task.upload_artifact
?def upload_artifact(..., wait_for_upload: bool = False): ... if wait_for_upload: self.flush(wait_for_uploads=True)
So either I specify in the clearml-agent agent.python_binary: python3.8 as you suggested, or I enforce the task locally to run with python3.8 using task.data.script.binary
I assume you’re using a self-hosted server?
Yes
Ho nice, thanks for pointing this out!