Reputation
Badges 1
979 × Eureka!Usually one or two tags, indeed, task ids are not so convenient, but only because they are not displayed in the page, so I have to go back to another page to check the ID of each experiment. Maybe just showing the ID of each experiment in the SCALAR page would already be great, wdyt?
but most likely I need to update the perms of /data as well
There is no need to add creds on the machine, since the EC2 instance has an attached IAM profile that grants access to s3. Boto3 is able retrieve the files from the s3 bucket
then print(Task.get_project_object().default_output_destination)
is still the old value
no it doesn't! 3. They select any point that is an improvement over time
Same, it also returns a ProxyDictPostWrite
, which is not supported by OmegaConf.create
To be fully transparent, I did a manual reindexing of the whole ES DB one year ago after it run out of space, at that point I might have changed the mapping to strict, but I am not sure. Could you please confirm that the mapping is correct?
the deep learning AMI from nvidia (Ubuntu 18.04)
Yes, I would like to update all references to the old bucket unfortunatelyā¦ I think Iāll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I donāt have to mess with clearml data - I am afraid to do something wrong and loose data
Will it freeze/crash/break/stop the ongoing experiments?
Yes I did, I found the problem: docker-compose was using trains-server 0.15 because it didn't see the new version of trains-server. Hence I had trains-server 0.15 running with ES7.
-> I deleted all the containers and it successfully pulled trains-server 0.16. Now everything is running properly š
I am already trying with latest of pip š
So it looks like the agent, from time to time thinks it is not running an experiment
if I want to resume a training on multi gpu, I will need to call this function on each process to send the weights to each gpu
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1
in the agents?
Or more generally, try to catch this error and retry a few times?
And since I ran the task locally with python3.9, it used that version in the docker container
Ho nice, thanks for pointing this out!
and then call task.connect_configuration probably
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: false
So it states that IAM role using metadata service should be supported, right?
Thanks a lot for the solution SuccessfulKoala55 ! Iāll try that if the solution ādelete old bucket, wait for its name to be available, recreate it with the other aws account, transfer the data backā fails
I can ssh into the agent and:source /trains-agent-venv/bin/activate (trains_agent_venv) pip show pyjwt Version: 1.7.1
Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
should I try to roll back to clearml-server 1.0.2? I am very anxious nowā¦
it actually looks like I donāt need such a high number of files opened at the same time
SuccessfulKoala55 I want to avoid writing creds in plain text in the config file