Reputation
Badges 1
979 × Eureka!But that was too complicated, I found an easier approach
I asked this question some time ago, I think this is just not implemented but it shouldnโt be difficult to add? I am also interested in such feature!
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
Sorry, its actuallytask.update_requirements(["."])ย
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
We would be super happy to have the possibility of documenting experiments (new tab in experiments UI) with a markdown editor!
Does the agent install the nvidia-container toolkit, so that GPUs of the instance can be accessed from inside the docker running jupyterlab?
I understand, but then why the docker mode is an option of the CLI if we always have to use it so that it works?
yea I just realized that you would also need to specify different subnets, etcโฆ not sure how easy it is ๐ But it would be very valuable, on-demand GPU instances are so hard to spin up nowadays in aws ๐
The main issue is the task_logger.report_scalar()
not reporting the scalars
Sure yes! As you can see I just added the blocklogging: driver: "json-file" options: max-size: "200k" max-file: "10"
To all services. Also in this docker-compose I removed the external binding of the ports for mongo/redis/es
Interesting! Something like that would be cool yes! I just realized that custom plugins in Mattermost are written in Go, could be a good hackday for me ๐ to learn go
AgitatedDove14 yes! I now realise that the ignite events callbacks seem to not be fired (I tried to print a debug message on a custom Events.ITERATION_COMPLETED) and I cannot see it logged
I opened an https://github.com/pytorch/ignite/issues/2343 in igniteโs repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init
in distributed envs
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
Hi SuccessfulKoala55 , will I be able to update all references to the old s3 bucket using this command?
I had this problem before
thanks for clarifying! Maybe this could be clarified in the agent logs of the experiments with something like the following?agent.cuda_driver_version = ... agent.cuda_runtime_version = ...
AgitatedDove14 In theory yes there is no downside, in practice running an app inside docker inside a VM might introduce slowdowns. I guess itโs on me to check whether this slowdown is negligible or not
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
You already fixed the problem with pyjwt in the newest version of clearml/clearml-agents, so all good ๐