Hey SubstantialElk6 ,
I'm assuming you are referring to our helm charts?
If so, then you can set agent.dockerMode
to false ( https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-chart/values.yaml#L46 ), and then the docker.sock
configuration will be turned off. Note that this means that your agents will not be running on docker mode 🙂
subnet isn't supported as is in autoscaler, but you can add it using extra_configurations
the following way:extra_configurations = {'SubnetId': <subnet-id>}
Also, can you send the entire log?
BTW, is there any specific reason for not upgrading to clearml? 🙂
ok, so first, since you have many installations in your bash script, it does make sense that installation would take a long time (note that the agent will only start running after all installations are done)
So for the sake of debugging I'd suggest to remove all the packages (other than the specific trains-agent that you're using) and try again, add those packages to the task you are trying to run and you should see the instance much faster.
Hey DeliciousBluewhale87 ,
It seems like this log is the log of a task that was pulled by the agent running on the clearml-services pod, is this the case? Where did you find the above log?
Also - can you please send us the list of all the running pods in the namespace? I want to make sure the other agents are up.
Hey LovelyHamster1 ,
Any chance the task you are trying to run has a base docker defined in it?
security_group_ids = ["<sec_group_id>"]
(note that I had a typo it's the id not the name, don't want to misguide you!)
Can you send me your configurations? I want to make sure there's nothing we're missing there.
(without the actual keys and secrets of course)
Sure, we're using RunInstances
, you can see the call itself https://github.com/allegroai/trains/blob/master/trains/automation/aws_auto_scaler.py#L163
Hey SubstantialElk6 ,
You can see the bash script that installs the container https://github.com/allegroai/clearml-agent/blob/master/clearml_agent/glue/k8s.py#L61 .
You are correct that it does do apt-get update
in order to install some stuff.
You can override this entire list of commands by adding another bash script as a string using the container_bash_script
argument. Make sure you add it to the example script (should be added to the initialization https://github.com/allegr...
Hey GreasyPenguin14 ,
The docker-compose.yml and this section specifically were updated.
So first please try again with the new version 🙂
Second - this error seems a bit odd, which version of docker-compose are you using?
You can check this using: docker-compose --version
Hey LovelyHamster1 ,
If s3 is what you're interested of, then the above would do the trick.
Note that you can attach the IAM using instance profiles. You can read about those here:
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html
Once you have an instance profile, you can add it to the autoscaler using the extra_configurations
section in the autoscaler.
Under your resource_configurations
-> some resource name
-> add an ...
Make sure you're testing it on the same computer the autoscaler is running on
Hey JitteryCoyote63 ,
Autoscaler was tested with full ec2 permissions.
I believe you only need the following:ec2:StartInstances ec2:StopInstances ec2:DescribeInstances
But there might be some others we're missing.
WackyRabbit7 - I think you asked this question before, do you have some more input you can share here?
Good, are we sure that the problem is that the variable isn't set?
Can you please use kubectl describe pod <task-pod-name>
and send me the output?
Seems like the env variable isn't passed for some reason, we'll push a fix for this issue soon, I'll keep you posted 🙂
What do you mean by ' not taking effect with the k8s glue '?
Can you try setting the base_docker_image of the specific task you are running to nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
?
To do so go to the task's execution tab, scroll down and set the base docker section to the above.
ok that's odd.
Anyway try settingextra_configurations = {"SubnetId": "<subnet-id>"}
instead of:extra_configurations = {'SubnetId': "<subnet-id>"}
Hey WackyRabbit7 ,
Is this the only error you have there?
Can you verify the credentials in the task seem ok and that it didn't disappear as before?
Also, I understand that the Failed parsing task parameter ...
warnings no longer appear, correct?
Hey JitteryCoyote63 !
Can you please update us what permissions did you end up using for the autoscaler?
Were the above enough?
Thanks!
Let me know if this solves your problem
I understand, but for some reason you are getting an error about the clearml webserver. try changing the value in the values.yaml file for the agent.clearmlWebHost to the same value you filled manually for the agent-services Web host
As an example you can ssh to it and try running trains-agent manually to see if it's installed and if it fails for some reason.
If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on
Can you check which trains version appears under the package requirements for the autoscaler?
Probably something's wrong with the instance, which AMI you used? the default one?