Can you check which trains version appears under the package requirements for the autoscaler?
Hey SubstantialElk6 ,
I'm assuming you are referring to our helm charts?
If so, then you can set agent.dockerMode
to false ( https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-chart/values.yaml#L46 ), and then the docker.sock
configuration will be turned off. Note that this means that your agents will not be running on docker mode 🙂
Hey JitteryCoyote63 ,
Autoscaler was tested with full ec2 permissions.
I believe you only need the following:ec2:StartInstances ec2:StopInstances ec2:DescribeInstances
But there might be some others we're missing.
WackyRabbit7 - I think you asked this question before, do you have some more input you can share here?
ok, so first, since you have many installations in your bash script, it does make sense that installation would take a long time (note that the agent will only start running after all installations are done)
So for the sake of debugging I'd suggest to remove all the packages (other than the specific trains-agent that you're using) and try again, add those packages to the task you are trying to run and you should see the instance much faster.
I'd suggest that you try what AgitatedDove14 suggested https://clearml.slack.com/archives/CTK20V944/p1614540843119600?thread_ts=1613923591.002100&cid=CTK20V944 . It seems like you're using an older version of the agent somehow.
I think using the glue could be a good solution for you, so it seems like a good idea to try and get it to work.
ColossalAnt7 can you try connecting to one of the trains-agent pods and run trains-agent manually using the following command:TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-version
Then let us know what happens and if you see the new worker it in the UI
Hey SubstantialElk6 ,
Can you show us the top output you get when using the template-yaml instead of overrides-yaml?
To check, go to the experiment's page and then to EXECUTIONÂ >Â AGENT CONFIGURATIONÂ >Â BASE DOCKER IMAGE
If it's set to any value, clearing it would solve your problem.
Hey LovelyHamster1 ,
Any chance the task you are trying to run has a base docker defined in it?
subnet isn't supported as is in autoscaler, but you can add it using extra_configurations
the following way:extra_configurations = {'SubnetId': <subnet-id>}
Hey LovelyHamster1 ,
This means that for some reason the agent on the instances created fails to run and the instance is terminated.
The credentials could definatly cause that.
Can you try adding the credentials as they appear in your clearml.conf?
To do so, create new credentials from your profile page in the UI, and add the entire section to the extra_trains_conf
section in the following way:
` extra_trains_conf = """
api {
web_server: "<webserver>"
api_server: "<apiserver>"
...
Let me know if this solves your problem
Great.
Note that instead of removing those lines you can override it using the extra_vm_bash_script
For example:extra_vm_bash_script = """ export CLEARML_API_HOST=<api_server> export CLEARML_WEB_HOST=<web_server> export CLEARML_FILES_HOST=<files_server> """
Hey ColossalAnt7 ,
What version of trains-agent are you using?
You can try upgrading to the latest RC version, this issue should be fixed there:pip install trains-agent==0.16.2rc1
Seems like the env variable isn't passed for some reason, we'll push a fix for this issue soon, I'll keep you posted 🙂
We will fix it and remove the deprecated flags.
In any case it shouldn't cause issues with your tasks.. Is it running?
Sure, ping me if it's still happening.
Or - which api-server the UI is actually connecting to? 🙂
Sure, we're using RunInstances
, you can see the call itself https://github.com/allegroai/trains/blob/master/trains/automation/aws_auto_scaler.py#L163
As an example you can ssh to it and try running trains-agent manually to see if it's installed and if it fails for some reason.
Those are different credentials.
You should have the aws info under:cloud_credentials_key
, cloud_credentials_secret
and cloud_credentials_region
And the stuff added to the extra_vm_bash_script
are the trains key and secret from your profile page in the UI.
I suggest you use the wizard again to run the task, this will make sure all the data is where it should be.
I understand, but for some reason you are getting an error about the clearml webserver. try changing the value in the values.yaml file for the agent.clearmlWebHost to the same value you filled manually for the agent-services Web host
Hey LovelyHamster1 ,
If s3 is what you're interested of, then the above would do the trick.
Note that you can attach the IAM using instance profiles. You can read about those here:
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html
Once you have an instance profile, you can add it to the autoscaler using the extra_configurations
section in the autoscaler.
Under your resource_configurations
-> some resource name
-> add an ...
Again, assuming you are referring to the helm charts. How are you deploying ClearML?
When you open the UI, do you see any projects there?
If the configurations and hyper params still appear properly in the task there's no need to rerun the wizard. just make sure you're using the updated trains repo
That's the agent-services one, can you check the agent's one?
Actually I removed the key pair, as you said it wasn't a must in the newer versions
It isn't a must, but if you are using one, it should be in the same region
If you want the agent to run in docker mode, the docker.sock should be exposed. But that's the only reason for this configuration.