Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
We would be super happy to have the possibility of documenting experiments (new tab in experiments UI) with a markdown editor!
Does the agent install the nvidia-container toolkit, so that GPUs of the instance can be accessed from inside the docker running jupyterlab?
I understand, but then why the docker mode is an option of the CLI if we always have to use it so that it works?
yea I just realized that you would also need to specify different subnets, etc… not sure how easy it is 😞 But it would be very valuable, on-demand GPU instances are so hard to spin up nowadays in aws 😄
The main issue is the task_logger.report_scalar()
not reporting the scalars
Sure yes! As you can see I just added the blocklogging: driver: "json-file" options: max-size: "200k" max-file: "10"
To all services. Also in this docker-compose I removed the external binding of the ports for mongo/redis/es
Interesting! Something like that would be cool yes! I just realized that custom plugins in Mattermost are written in Go, could be a good hackday for me 😄 to learn go
AgitatedDove14 yes! I now realise that the ignite events callbacks seem to not be fired (I tried to print a debug message on a custom Events.ITERATION_COMPLETED) and I cannot see it logged
I opened an https://github.com/pytorch/ignite/issues/2343 in ignite’s repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init
in distributed envs
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
Hi SuccessfulKoala55 , will I be able to update all references to the old s3 bucket using this command?
I had this problem before
thanks for clarifying! Maybe this could be clarified in the agent logs of the experiments with something like the following?agent.cuda_driver_version = ... agent.cuda_runtime_version = ...
AgitatedDove14 In theory yes there is no downside, in practice running an app inside docker inside a VM might introduce slowdowns. I guess it’s on me to check whether this slowdown is negligible or not
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
You already fixed the problem with pyjwt in the newest version of clearml/clearml-agents, so all good 😄
I was able to fix by applying for a license and registering it
(by console you mean in the dashboard right? or the terminal?)
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
with my hack yes, without, no