Reputation
Badges 1
979 × Eureka!Thanks for the hack! The use case is the following: I have a controler that creates training/validation/testing tasks by cloning (so that the parent task id is properly set to the controler). Otherwise I could simply create these tasks with Task.init, but then I would need to set manually the parent task for each one of these tasks, probably with a similar hack, right?
This is no coincidence - Any data versioning tool you will find are somehow close to how git works (dvc, etc.) since they aim to solve a similar problem. In the end, datasets are just files.
Where clearml-data stands out imo is the straightfoward CLI combined with the Pythonic API that allows you to register/retrieve datasets very easily
AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false
(because I had the feeling that Task._report_subprocess_enabled = False
wasn’t taken into account) I’ve set task.set_initial_iteration(0)
Now I was able to get the followin graph after resuming -
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
I ended up dropping omegaconf altogether
But we can easily extend, right?
haa got it, I am on a self hosted server, that’s why I don’t see it
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
Hi SmugDolphin23 thanks for the input! Will try now but that seems hacky: to have it working I have to specify python3.8 two times:
one in the agent config file (agent.default_python is already python3.8, but seems to be ignored) + make sure it is available (using python:3.8 docker image)Is there a way to prevent this redundancy? Ie. If I want to change the python version, I can control it from a single place?
But that was too complicated, I found an easier approach
I asked this question some time ago, I think this is just not implemented but it shouldn’t be difficult to add? I am also interested in such feature!
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
Sorry, its actuallytask.update_requirements(["."])
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
Yes, it did spin two instances for the same task
We would be super happy to have the possibility of documenting experiments (new tab in experiments UI) with a markdown editor!
Does the agent install the nvidia-container toolkit, so that GPUs of the instance can be accessed from inside the docker running jupyterlab?
I understand, but then why the docker mode is an option of the CLI if we always have to use it so that it works?
yea I just realized that you would also need to specify different subnets, etc… not sure how easy it is 😞 But it would be very valuable, on-demand GPU instances are so hard to spin up nowadays in aws 😄
The main issue is the task_logger.report_scalar()
not reporting the scalars
Sure yes! As you can see I just added the blocklogging: driver: "json-file" options: max-size: "200k" max-file: "10"
To all services. Also in this docker-compose I removed the external binding of the ports for mongo/redis/es