Can you send me your configurations? I want to make sure there's nothing we're missing there.
(without the actual keys and secrets of course)
ok that's odd.
Anyway try settingextra_configurations = {"SubnetId": "<subnet-id>"}
instead of:extra_configurations = {'SubnetId': "<subnet-id>"}
Hey SubstantialElk6 ,
I'm assuming you are referring to our helm charts?
If so, then you can set agent.dockerMode
to false ( https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-chart/values.yaml#L46 ), and then the docker.sock
configuration will be turned off. Note that this means that your agents will not be running on docker mode 🙂
So the issue was probably the clearml-agent version.
Please try using clearml-agent==0.17.2rc3 and let us know if this solved the issue.
Hey SubstantialElk6 ,
This issue was fixed in the latest clearml-agent version.
Please try using v0.17.2 🙂
I'd suggest that you try what AgitatedDove14 suggested https://clearml.slack.com/archives/CTK20V944/p1614540843119600?thread_ts=1613923591.002100&cid=CTK20V944 . It seems like you're using an older version of the agent somehow.
I think using the glue could be a good solution for you, so it seems like a good idea to try and get it to work.
If you want the agent to run in docker mode, the docker.sock should be exposed. But that's the only reason for this configuration.
We will fix it and remove the deprecated flags.
In any case it shouldn't cause issues with your tasks.. Is it running?
What do you mean by ' not taking effect with the k8s glue '?
Can you try setting the base_docker_image of the specific task you are running to nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
?
To do so go to the task's execution tab, scroll down and set the base docker section to the above.
Again, assuming you are referring to the helm charts. How are you deploying ClearML?
Good, are we sure that the problem is that the variable isn't set?
Can you please use kubectl describe pod <task-pod-name>
and send me the output?
Seems like the env variable isn't passed for some reason, we'll push a fix for this issue soon, I'll keep you posted 🙂
To check, go to the experiment's page and then to EXECUTIONÂ >Â AGENT CONFIGURATIONÂ >Â BASE DOCKER IMAGE
If it's set to any value, clearing it would solve your problem.
ColossalAnt7 can you try connecting to one of the trains-agent pods and run trains-agent manually using the following command:TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-version
Then let us know what happens and if you see the new worker it in the UI
When you open the UI, do you see any projects there?
That's great, from that I understand that the trains-services worker does appear in the UI, is it correct? Did the task run? Did you change the trainsApiHost
under agentservices
in the values.yaml?
Great, let us know how it goes.
Have a great weekend!
Let me know if this solves your problem
Hey LovelyHamster1 ,
This means that for some reason the agent on the instances created fails to run and the instance is terminated.
The credentials could definatly cause that.
Can you try adding the credentials as they appear in your clearml.conf?
To do so, create new credentials from your profile page in the UI, and add the entire section to the extra_trains_conf
section in the following way:
` extra_trains_conf = """
api {
web_server: "<webserver>"
api_server: "<apiserver>"
...
Great.
Note that instead of removing those lines you can override it using the extra_vm_bash_script
For example:extra_vm_bash_script = """ export CLEARML_API_HOST=<api_server> export CLEARML_WEB_HOST=<web_server> export CLEARML_FILES_HOST=<files_server> """
Hey GreasyPenguin14 ,
The docker-compose.yml and this section specifically were updated.
So first please try again with the new version 🙂
Second - this error seems a bit odd, which version of docker-compose are you using?
You can check this using: docker-compose --version
Or - which api-server the UI is actually connecting to? 🙂
Hey LovelyHamster1 ,
If s3 is what you're interested of, then the above would do the trick.
Note that you can attach the IAM using instance profiles. You can read about those here:
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html
Once you have an instance profile, you can add it to the autoscaler using the extra_configurations
section in the autoscaler.
Under your resource_configurations
-> some resource name
-> add an ...
By the way, are you editing the values directly? Why not use the values file?
That's the agent-services one, can you check the agent's one?
Can you try removing the port from the webhost?
Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost
and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.
I understand, but for some reason you are getting an error about the clearml webserver. try changing the value in the values.yaml file for the agent.clearmlWebHost to the same value you filled manually for the agent-services Web host