Actually I removed the key pair, as you said it wasn't a must in the newer versions
It isn't a must, but if you are using one, it should be in the same region
Hey LovelyHamster1 ,
This means that for some reason the agent on the instances created fails to run and the instance is terminated.
The credentials could definatly cause that.
Can you try adding the credentials as they appear in your clearml.conf?
To do so, create new credentials from your profile page in the UI, and add the entire section to the extra_trains_conf
section in the following way:
` extra_trains_conf = """
api {
web_server: "<webserver>"
api_server: "<apiserver>"
...
Hey DeliciousBluewhale87 ,
It seems like this log is the log of a task that was pulled by the agent running on the clearml-services pod, is this the case? Where did you find the above log?
Also - can you please send us the list of all the running pods in the namespace? I want to make sure the other agents are up.
If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on
So the issue was probably the clearml-agent version.
Please try using clearml-agent==0.17.2rc3 and let us know if this solved the issue.
Can you try setting the base_docker_image of the specific task you are running to nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
?
To do so go to the task's execution tab, scroll down and set the base docker section to the above.
Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost
and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.
Just making sure, you changed both the agent one and the agent-services one?
By the way, are you editing the values directly? Why not use the values file?
Great, let us know how it goes.
Have a great weekend!
I understand, but for some reason you are getting an error about the clearml webserver. try changing the value in the values.yaml file for the agent.clearmlWebHost to the same value you filled manually for the agent-services Web host
Make sure you're testing it on the same computer the autoscaler is running on
security_group_ids = ["<sec_group_id>"]
(note that I had a typo it's the id not the name, don't want to misguide you!)
Hey SubstantialElk6 ,
You can see the bash script that installs the container https://github.com/allegroai/clearml-agent/blob/master/clearml_agent/glue/k8s.py#L61 .
You are correct that it does do apt-get update
in order to install some stuff.
You can override this entire list of commands by adding another bash script as a string using the container_bash_script
argument. Make sure you add it to the example script (should be added to the initialization https://github.com/allegr...
You can try overriding the following in your values.yaml under agent
section:agentVersion: "==0.16.2rc1"
Hey GreasyPenguin14 ,
The docker-compose.yml and this section specifically were updated.
So first please try again with the new version 🙂
Second - this error seems a bit odd, which version of docker-compose are you using?
You can check this using: docker-compose --version
Hey LovelyHamster1 ,
Any chance the task you are trying to run has a base docker defined in it?
Hey SubstantialElk6 ,
Can you show us the top output you get when using the template-yaml instead of overrides-yaml?
Hey ColossalAnt7 ,
What version of trains-agent are you using?
You can try upgrading to the latest RC version, this issue should be fixed there:pip install trains-agent==0.16.2rc1
subnet isn't supported as is in autoscaler, but you can add it using extra_configurations
the following way:extra_configurations = {'SubnetId': <subnet-id>}
What do you mean by ' not taking effect with the k8s glue '?
SubstantialElk6 - As a side-note, since docker is about to be deprecated, sometime in the near future we plan to switch to another runtime. This actually means that the entire docker.sock issue will not be relevant very soon 🙂
That's the agent-services one, can you check the agent's one?
I waited 20 mins, refreshing the logs ever 2 mins.
Sounds like more than enough
Probably something's wrong with the instance, which AMI you used? the default one?
Searching this error it seems it could be many things.
Either wrong credentials or a wrong region (different than the one for your key-pair).
It could also be that your computer clock is wrong (see example https://github.com/mitchellh/vagrant-aws/issues/372#issuecomment-87429450 ).
I suggest you search it online and see if it solves the issue, I think it requires some debugging on your end.