There is no need to add creds on the machine, since the EC2 instance has an attached IAM profile that grants access to s3. Boto3 is able retrieve the files from the s3 bucket
then print(Task.get_project_object().default_output_destination)
is still the old value
no it doesn't! 3. They select any point that is an improvement over time
Same, it also returns a ProxyDictPostWrite
, which is not supported by OmegaConf.create
To be fully transparent, I did a manual reindexing of the whole ES DB one year ago after it run out of space, at that point I might have changed the mapping to strict, but I am not sure. Could you please confirm that the mapping is correct?
the deep learning AMI from nvidia (Ubuntu 18.04)
Yes, I would like to update all references to the old bucket unfortunately… I think I’ll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I don’t have to mess with clearml data - I am afraid to do something wrong and loose data
Will it freeze/crash/break/stop the ongoing experiments?
Yes I did, I found the problem: docker-compose was using trains-server 0.15 because it didn't see the new version of trains-server. Hence I had trains-server 0.15 running with ES7.
-> I deleted all the containers and it successfully pulled trains-server 0.16. Now everything is running properly 🙂
So it looks like the agent, from time to time thinks it is not running an experiment
if I want to resume a training on multi gpu, I will need to call this function on each process to send the weights to each gpu
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1
in the agents?
Or more generally, try to catch this error and retry a few times?
AgitatedDove14 It was only on comparison as far as I remember
AgitatedDove14 In theory yes there is no downside, in practice running an app inside docker inside a VM might introduce slowdowns. I guess it’s on me to check whether this slowdown is negligible or not
Note: I can verify that post_packages is well picked up by the trains-agent, since in the experiment log I see:agent.package_manager.type = pip agent.package_manager.pip_version = \=\=20.2.3 agent.package_manager.system_site_packages = true agent.package_manager.force_upgrade = false agent.package_manager.post_packages.0 = PyJWT\=\=1.7.1
Yes, I stayed with an older version for a compatibility reason I cannot remember now 😄 - just tested with 1.1.2 and it’s the same
I tried specifying the bucket directly in my clearml.conf, same problem. I guess clearml just reads from the env vars first
Thanks for the help SuccessfulKoala55 , the problem was solved by updating the docker-compose file to the latest version in the repo: https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml
Make sure to do docker-compose down & docker-compose up -d
afterwards, and not docker-compose restart
AgitatedDove14 WOW, thanks a lot! I will dig into that 🚀
I ended up dropping omegaconf altogether
Hi AgitatedDove14 , sorry somehow this message got lost 😄
clearml version is the latest at the time, 1.7.1
Yes, I always see the "model uploaded completed" for such stuck tasks I am using python 3.8.10
oh seems like it is not synced, thank you for noticing (it will be taken care immediately)
Thank you!
does not contain a specific wheel for cuda117 to x86, they use the pip defualt one
Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing
my agents are all .16 and I install trains 0.16rc2 in each Task being executed by the agent
yes, the only thing I changed is:install_requires=[ ... "my-dep @ git+
]
to:install_requires=[ ... "git+
"]
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
If I manually call report_matplotlib_figure yes. If I don't (just create the figure), no mem leak
AgitatedDove14 I see that the default sample_frequency_per_sec=2.
, but in the UI, I see that there isn’t such resolution (ie. it logs every ~120 iterations, corresponding to ~30 secs.) What is the difference with report_frequency_sec=30.
?
I have a custom way of reading the config file