Reputation
Badges 1
979 × Eureka!I created a snapshot of both disks
same as the first one described
CostlyOstrich36 , this also happens with clearml-agent 1.1.1 on a aws instanceā¦
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
Ā you mean ādockerā was not installed and it did not throw an error ?
Yes docker was not installed in the machine
Yes you must make sure the docker can mount a persistent folder for you to work on.
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
The workaround I could find for now is to add the following to CONTAINER > SETUP SHELL SCRIPT:mkdir -p ~/git/credential chmod 0700 ~/git/credential git config --global credential.helper 'cache --socket ~/git/credential/socket'
Alright, thanks for the answer! Seems legit then š
Thanks for the hack! The use case is the following: I have a controler that creates training/validation/testing tasks by cloning (so that the parent task id is properly set to the controler). Otherwise I could simply create these tasks with Task.init, but then I would need to set manually the parent task for each one of these tasks, probably with a similar hack, right?
This is no coincidence - Any data versioning tool you will find are somehow close to how git works (dvc, etc.) since they aim to solve a similar problem. In the end, datasets are just files.
Where clearml-data stands out imo is the straightfoward CLI combined with the Pythonic API that allows you to register/retrieve datasets very easily
AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false
(because I had the feeling that Task._report_subprocess_enabled = False
wasnāt taken into account) Iāve set task.set_initial_iteration(0)
Now I was able to get the followin graph after resuming -
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
I ended up dropping omegaconf altogether
But we can easily extend, right?
haa got it, I am on a self hosted server, thatās why I donāt see it
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
Hi SmugDolphin23 thanks for the input! Will try now but that seems hacky: to have it working I have to specify python3.8 two times:
one in the agent config file (agent.default_python is already python3.8, but seems to be ignored) + make sure it is available (using python:3.8 docker image)Is there a way to prevent this redundancy? Ie. If I want to change the python version, I can control it from a single place?
But that was too complicated, I found an easier approach
I asked this question some time ago, I think this is just not implemented but it shouldnāt be difficult to add? I am also interested in such feature!
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
Sorry, its actuallytask.update_requirements(["."])Ā
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)