If you want the agent to run in docker mode, the docker.sock should be exposed. But that's the only reason for this configuration.
Sure, we're using RunInstances
, you can see the call itself https://github.com/allegroai/trains/blob/master/trains/automation/aws_auto_scaler.py#L163
Also, can you send the entire log?
Can you try removing the port from the webhost?
Can you send me your configurations? I want to make sure there's nothing we're missing there.
(without the actual keys and secrets of course)
If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on
ok that's odd.
Anyway try settingextra_configurations = {"SubnetId": "<subnet-id>"}
instead of:extra_configurations = {'SubnetId': "<subnet-id>"}
You can try overriding the following in your values.yaml under agent
section:agentVersion: "==0.16.2rc1"
Let me know if this solves your problem
Hey JitteryCoyote63 ,
Autoscaler was tested with full ec2 permissions.
I believe you only need the following:ec2:StartInstances ec2:StopInstances ec2:DescribeInstances
But there might be some others we're missing.
WackyRabbit7 - I think you asked this question before, do you have some more input you can share here?
Actually I removed the key pair, as you said it wasn't a must in the newer versions
It isn't a must, but if you are using one, it should be in the same region
Sure, ping me if it's still happening.
Hey LovelyHamster1 ,
Any chance the task you are trying to run has a base docker defined in it?
security_group_ids = ["<sec_group_id>"]
(note that I had a typo it's the id not the name, don't want to misguide you!)
That's the agent-services one, can you check the agent's one?
Make sure you're testing it on the same computer the autoscaler is running on
I understand, but for some reason you are getting an error about the clearml webserver. try changing the value in the values.yaml file for the agent.clearmlWebHost to the same value you filled manually for the agent-services Web host
Searching this error it seems it could be many things.
Either wrong credentials or a wrong region (different than the one for your key-pair).
It could also be that your computer clock is wrong (see example https://github.com/mitchellh/vagrant-aws/issues/372#issuecomment-87429450 ).
I suggest you search it online and see if it solves the issue, I think it requires some debugging on your end.
BTW, is there any specific reason for not upgrading to clearml? 🙂
Hey LovelyHamster1 ,
If s3 is what you're interested of, then the above would do the trick.
Note that you can attach the IAM using instance profiles. You can read about those here:
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html
Once you have an instance profile, you can add it to the autoscaler using the extra_configurations
section in the autoscaler.
Under your resource_configurations
-> some resource name
-> add an ...
Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost
and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.
Hey ColossalAnt7 ,
What version of trains-agent are you using?
You can try upgrading to the latest RC version, this issue should be fixed there:pip install trains-agent==0.16.2rc1
Hey JitteryCoyote63 !
Can you please update us what permissions did you end up using for the autoscaler?
Were the above enough?
Thanks!
If the configurations and hyper params still appear properly in the task there's no need to rerun the wizard. just make sure you're using the updated trains repo
To check, go to the experiment's page and then to EXECUTIONÂ >Â AGENT CONFIGURATIONÂ >Â BASE DOCKER IMAGE
If it's set to any value, clearing it would solve your problem.
Hey SubstantialElk6 ,
I'm assuming you are referring to our helm charts?
If so, then you can set agent.dockerMode
to false ( https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-chart/values.yaml#L46 ), and then the docker.sock
configuration will be turned off. Note that this means that your agents will not be running on docker mode 🙂
Good, are we sure that the problem is that the variable isn't set?
Can you please use kubectl describe pod <task-pod-name>
and send me the output?
SubstantialElk6 - As a side-note, since docker is about to be deprecated, sometime in the near future we plan to switch to another runtime. This actually means that the entire docker.sock issue will not be relevant very soon 🙂