Reputation
Badges 1
981 × Eureka!It worked like a charm π± Awesome thanks AgitatedDove14 !
TimelyPenguin76 That sounds amazing! will there be a fallback mechanism as well? often p3.2xlarge are on shortage, would be nice to define one resources req as first choice (eg. p3.2xlarge) -> if not available -> use another resources req (eg. g4dn)
That would be amazing!
This allows me to inject yaml files into other yaml files
the api-server shows when starting:clearml-apiserver | [2021-07-13 11:09:34,552] [9] [INFO] [clearml.es_factory] Using override elastic host `
clearml-apiserver | [2021-07-13 11:09:34,552] [9] [INFO] [clearml.es_factory] Using override elastic port 9200
...
clearml-apiserver | [2021-07-13 11:09:38,407] [9] [WARNING] [clearml.initialize] Could not connect to ElasticSearch Service. Retry 1 of 4. Waiting for 30sec
clearml-apiserver | [2021-07-13 11:10:08,414] [9] [WARNING] [clearml.initia...
Just tested locally, in terminal its the same: with the hack it works, without the hack it doesn't show the logger messages
@<1537605940121964544:profile|EnthusiasticShrimp49> I'll try setting the cuda version clearml.conf, thanks for the tip!
@<1523701205467926528:profile|AgitatedDove14> Could you please push the code for that version on github?
Thanks @<1523701087100473344:profile|SuccessfulKoala55> ! Are alive workers sending ping to notify the server that they are alive or does the server infers that they are alive based on the last communication?
Looks like its a hurray then π π πΎ
It could be: I am running the clearml aws autoscaler in an ec2 instance having iam roles allowing for creating/deleting instances, but I get Warning! exception occurred: An error occurred (UnauthorizedOperation) when calling the RunInstances operation: You are not authorized to perform this operation. Encoded authorization failure message: ...
I suspect that since the agent is running in docker mode, the boto3 lib doesnβt automatically get the right permissions from the ec2-instance. To...
Ok, this I cannot locate
interestingly, it works on one machine, but not on another one
Yea I really need that feature, I need to move away from key/secrets to iam roles
but the post_packages does not reinstalls the version 1.7.1
yes, so it does exit the local process (at least, the command returns), but another process is still running on the background and is logging things from time to time (such as:)ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
with 1.1.1 I getUser aborted: stopping task (3)
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117. It just happens that this wheel doesn't work in ec2 g5 instances suprizingly. Either I'll hardcode the correct wheel or I'll upgrade torch to 1.13.0
SuccessfulKoala55 I want to avoid writing creds in plain text in the config file
yea I just realized that you would also need to specify different subnets, etcβ¦ not sure how easy it is π But it would be very valuable, on-demand GPU instances are so hard to spin up nowadays in aws π
SuccessfulKoala55 Here is the trains-elastic error
the reindexing operation showed no error and copied everything
The workaround I could find for now is to add the following to CONTAINER > SETUP SHELL SCRIPT:mkdir -p ~/git/credential chmod 0700 ~/git/credential git config --global credential.helper 'cache --socket ~/git/credential/socket'