Hey DeliciousBluewhale87 ,
It seems like this log is the log of a task that was pulled by the agent running on the clearml-services pod, is this the case? Where did you find the above log?
Also - can you please send us the list of all the running pods in the namespace? I want to make sure the other agents are up.
Can you try removing the port from the webhost?
ok that's odd.
Anyway try settingextra_configurations = {"SubnetId": "<subnet-id>"}
instead of:extra_configurations = {'SubnetId': "<subnet-id>"}
Also, can you send the entire log?
I waited 20 mins, refreshing the logs ever 2 mins.
Sounds like more than enough
BTW, is there any specific reason for not upgrading to clearml? ๐
Just making sure, you changed both the agent one and the agent-services one?
You can try overriding the following in your values.yaml under agent
section:agentVersion: "==0.16.2rc1"
By the way, are you editing the values directly? Why not use the values file?
Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost
and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.
Hey JitteryCoyote63 !
Can you please update us what permissions did you end up using for the autoscaler?
Were the above enough?
Thanks!
That's great, from that I understand that the trains-services worker does appear in the UI, is it correct? Did the task run? Did you change the trainsApiHost
under agentservices
in the values.yaml?
Great, let us know how it goes.
Have a great weekend!
Can you send me your configurations? I want to make sure there's nothing we're missing there.
(without the actual keys and secrets of course)
Make sure you're testing it on the same computer the autoscaler is running on
If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on
What do you mean by ' not taking effect with the k8s glue '?
Can you try setting the base_docker_image of the specific task you are running to nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
?
To do so go to the task's execution tab, scroll down and set the base docker section to the above.
So the issue was probably the clearml-agent version.
Please try using clearml-agent==0.17.2rc3 and let us know if this solved the issue.
Good, are we sure that the problem is that the variable isn't set?
Can you please use kubectl describe pod <task-pod-name>
and send me the output?
SubstantialElk6 - As a side-note, since docker is about to be deprecated, sometime in the near future we plan to switch to another runtime. This actually means that the entire docker.sock issue will not be relevant very soon ๐
Hey WackyRabbit7 ,
Is this the only error you have there?
Can you verify the credentials in the task seem ok and that it didn't disappear as before?
Also, I understand that the Failed parsing task parameter ...
warnings no longer appear, correct?
Probably something's wrong with the instance, which AMI you used? the default one?
Hey SubstantialElk6 ,
This issue was fixed in the latest clearml-agent version.
Please try using v0.17.2 ๐
Searching this error it seems it could be many things.
Either wrong credentials or a wrong region (different than the one for your key-pair).
It could also be that your computer clock is wrong (seeย example https://github.com/mitchellh/vagrant-aws/issues/372#issuecomment-87429450 ).
I suggest you search it online and see if it solves the issue, I think it requires some debugging on your end.
security_group_ids = ["<sec_group_id>"]
(note that I had a typo it's the id not the name, don't want to misguide you!)