As an example you can ssh to it and try running trains-agent manually to see if it's installed and if it fails for some reason.
ok that's odd.
Anyway try settingextra_configurations = {"SubnetId": "<subnet-id>"}
instead of:extra_configurations = {'SubnetId': "<subnet-id>"}
Probably something's wrong with the instance, which AMI you used? the default one?
Can you send me your configurations? I want to make sure there's nothing we're missing there.
(without the actual keys and secrets of course)
So the issue was probably the clearml-agent version.
Please try using clearml-agent==0.17.2rc3 and let us know if this solved the issue.
Can you try removing the port from the webhost?
Hey WackyRabbit7 ,
Is this the only error you have there?
Can you verify the credentials in the task seem ok and that it didn't disappear as before?
Also, I understand that the Failed parsing task parameter ...
warnings no longer appear, correct?
If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on
If the configurations and hyper params still appear properly in the task there's no need to rerun the wizard. just make sure you're using the updated trains repo
Hey LovelyHamster1 ,
If s3 is what you're interested of, then the above would do the trick.
Note that you can attach the IAM using instance profiles. You can read about those here:
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html
Once you have an instance profile, you can add it to the autoscaler using the extra_configurations
section in the autoscaler.
Under your resource_configurations
-> some resource name
-> add an ...
Sure, ping me if it's still happening.
If you want the agent to run in docker mode, the docker.sock should be exposed. But that's the only reason for this configuration.
Hey LovelyHamster1 ,
This means that for some reason the agent on the instances created fails to run and the instance is terminated.
The credentials could definatly cause that.
Can you try adding the credentials as they appear in your clearml.conf?
To do so, create new credentials from your profile page in the UI, and add the entire section to the extra_trains_conf
section in the following way:
` extra_trains_conf = """
api {
web_server: "<webserver>"
api_server: "<apiserver>"
...
What do you mean by ' not taking effect with the k8s glue '?
Hey JitteryCoyote63 ,
Autoscaler was tested with full ec2 permissions.
I believe you only need the following:ec2:StartInstances ec2:StopInstances ec2:DescribeInstances
But there might be some others we're missing.
WackyRabbit7 - I think you asked this question before, do you have some more input you can share here?
Seems like the env variable isn't passed for some reason, we'll push a fix for this issue soon, I'll keep you posted 🙂
Searching this error it seems it could be many things.
Either wrong credentials or a wrong region (different than the one for your key-pair).
It could also be that your computer clock is wrong (see example https://github.com/mitchellh/vagrant-aws/issues/372#issuecomment-87429450 ).
I suggest you search it online and see if it solves the issue, I think it requires some debugging on your end.
Hey LovelyHamster1 ,
Any chance the task you are trying to run has a base docker defined in it?
Sure, we're using RunInstances
, you can see the call itself https://github.com/allegroai/trains/blob/master/trains/automation/aws_auto_scaler.py#L163
Good, are we sure that the problem is that the variable isn't set?
Can you please use kubectl describe pod <task-pod-name>
and send me the output?
BTW, is there any specific reason for not upgrading to clearml? 🙂
I'd suggest that you try what AgitatedDove14 suggested https://clearml.slack.com/archives/CTK20V944/p1614540843119600?thread_ts=1613923591.002100&cid=CTK20V944 . It seems like you're using an older version of the agent somehow.
I think using the glue could be a good solution for you, so it seems like a good idea to try and get it to work.
Hey SubstantialElk6 ,
I'm assuming you are referring to our helm charts?
If so, then you can set agent.dockerMode
to false ( https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-chart/values.yaml#L46 ), and then the docker.sock
configuration will be turned off. Note that this means that your agents will not be running on docker mode 🙂
Hey GreasyPenguin14 ,
The docker-compose.yml and this section specifically were updated.
So first please try again with the new version 🙂
Second - this error seems a bit odd, which version of docker-compose are you using?
You can check this using: docker-compose --version
That's the agent-services one, can you check the agent's one?
Hey SubstantialElk6 ,
Can you show us the top output you get when using the template-yaml instead of overrides-yaml?
Again, assuming you are referring to the helm charts. How are you deploying ClearML?
Those are different credentials.
You should have the aws info under:cloud_credentials_key
, cloud_credentials_secret
and cloud_credentials_region
And the stuff added to the extra_vm_bash_script
are the trains key and secret from your profile page in the UI.
I suggest you use the wizard again to run the task, this will make sure all the data is where it should be.