Reputation
Badges 1
979 × Eureka!So it seems like it doesn't copy /root/clearml.conf and it doesn't pass the environment variables (CLEARML_API_HOST, CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY)
SuccessfulKoala55 I tried to setup in a different machine the clearml-agent and now I get a different error message in the logs:Warning: could not locate requested Python version 3.6, reverting to version 3.6 clearml_agent: ERROR: Python executable with version '3.6' defined in configuration file, key 'agent.default_python', not found in path, tried: ('python3.6', 'python3', 'python')
how would it interact with the clearml-server api service? would it be completely transparent?
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
In the comparison the problem will be the same, right? If I choose last/min/max values, it wonโt tell me the corresponding values for others metrics. I could switch to graphs, group by metric and look manually for the corresponding values, but that becomes quickly cumbersome as the number of experiments compared grow
Ho and also use the colors of the series. That would be a killer feature. Then I simply need to match the color of the series to the name to check the tags
to pass secrets to each experiment
I would like to try it to see if it solves some dependencies not found eventhough they are installed when using --system-site-packages
It failed as well
(Even if I explicitely do my_task.close() )
Nevertheless there might still be some value in that, because it would allow to reduce the starting time by removing the initial setup of the agent + downloading of the data to the instance - but not as much as I described initially, if instances stopped are bound to the same capacity limitations as new instances launched
yes, done! Is there something more to take into account than what I shared?
Yes AnxiousSeal95 , stopped instance meaning you donโt pay for it, but just its storage, as described https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html . So AgitatedDove14 increasing the IDLE timeout would still make me pay for the instance while they are idle.
Do you get stopped instances instantely when you ask for them?
Well thatโs a good question, thatโs what I observed some time ago, but according to their https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/...
Interesting - I can reproduce easily
Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...
(docker was install with sudo snap install docker
)
Could you please point me to the relevant component? I am not familiar with typescript unfortunately ๐
Usually one or two tags, indeed, task ids are not so convenient, but only because they are not displayed in the page, so I have to go back to another page to check the ID of each experiment. Maybe just showing the ID of each experiment in the SCALAR page would already be great, wdyt?
There is a pinned github thread on https://github.com/allegroai/clearml/issues/81 , seems to be the right place?
sure, will be happy to debug that ๐
Is there any channel where we can see when new self hosted server version are published?
You already fixed the problem with pyjwt in the newest version of clearml/clearml-agents, so all good ๐
AgitatedDove14 I am actually considering rolling back to 1.1.0, so 1.3.0 is not really an option for now
not really, because it is in the middle of the controller task, there are other things to be done afterwards (retrieving results, logging new artifacts, creating new tasks, etc)
awesome ๐
Maybe then we can extend task.upload_artifact
?def upload_artifact(..., wait_for_upload: bool = False): ... if wait_for_upload: self.flush(wait_for_uploads=True)
So I cannot ssh anymore to the agent after starting clearml-session on it