Great, let us know how it goes.
Have a great weekend!
Hey JitteryCoyote63 !
Can you please update us what permissions did you end up using for the autoscaler?
Were the above enough?
Thanks!
Just making sure, you changed both the agent one and the agent-services one?
That's great, from that I understand that the trains-services worker does appear in the UI, is it correct? Did the task run? Did you change the trainsApiHost
under agentservices
in the values.yaml?
Or - which api-server the UI is actually connecting to? 🙂
Hey GreasyPenguin14 ,
The docker-compose.yml and this section specifically were updated.
So first please try again with the new version 🙂
Second - this error seems a bit odd, which version of docker-compose are you using?
You can check this using: docker-compose --version
ColossalAnt7 can you try connecting to one of the trains-agent pods and run trains-agent manually using the following command:TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-version
Then let us know what happens and if you see the new worker it in the UI
When you open the UI, do you see any projects there?
Hey JitteryCoyote63 ,
Autoscaler was tested with full ec2 permissions.
I believe you only need the following:ec2:StartInstances ec2:StopInstances ec2:DescribeInstances
But there might be some others we're missing.
WackyRabbit7 - I think you asked this question before, do you have some more input you can share here?
To check, go to the experiment's page and then to EXECUTIONÂ >Â AGENT CONFIGURATIONÂ >Â BASE DOCKER IMAGE
If it's set to any value, clearing it would solve your problem.
Sure, ping me if it's still happening.
Hey LovelyHamster1 ,
Any chance the task you are trying to run has a base docker defined in it?
Hey SubstantialElk6 ,
I'm assuming you are referring to our helm charts?
If so, then you can set agent.dockerMode
to false ( https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-chart/values.yaml#L46 ), and then the docker.sock
configuration will be turned off. Note that this means that your agents will not be running on docker mode 🙂
If you want the agent to run in docker mode, the docker.sock should be exposed. But that's the only reason for this configuration.
I'd suggest that you try what AgitatedDove14 suggested https://clearml.slack.com/archives/CTK20V944/p1614540843119600?thread_ts=1613923591.002100&cid=CTK20V944 . It seems like you're using an older version of the agent somehow.
I think using the glue could be a good solution for you, so it seems like a good idea to try and get it to work.
If the configurations and hyper params still appear properly in the task there's no need to rerun the wizard. just make sure you're using the updated trains repo
subnet isn't supported as is in autoscaler, but you can add it using extra_configurations
the following way:extra_configurations = {'SubnetId': <subnet-id>}
If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on
Can you send me your configurations? I want to make sure there's nothing we're missing there.
(without the actual keys and secrets of course)
BTW, is there any specific reason for not upgrading to clearml? 🙂
Can you check which trains version appears under the package requirements for the autoscaler?
Also, can you send the entire log?
Hey WackyRabbit7 ,
Is this the only error you have there?
Can you verify the credentials in the task seem ok and that it didn't disappear as before?
Also, I understand that the Failed parsing task parameter ...
warnings no longer appear, correct?
As an example you can ssh to it and try running trains-agent manually to see if it's installed and if it fails for some reason.
Seems like the env variable isn't passed for some reason, we'll push a fix for this issue soon, I'll keep you posted 🙂
Can you try setting the base_docker_image of the specific task you are running to nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
?
To do so go to the task's execution tab, scroll down and set the base docker section to the above.
Good, are we sure that the problem is that the variable isn't set?
Can you please use kubectl describe pod <task-pod-name>
and send me the output?
Great.
Note that instead of removing those lines you can override it using the extra_vm_bash_script
For example:extra_vm_bash_script = """ export CLEARML_API_HOST=<api_server> export CLEARML_WEB_HOST=<web_server> export CLEARML_FILES_HOST=<files_server> """