Sure, we're using RunInstances , you can see the call itself https://github.com/allegroai/trains/blob/master/trains/automation/aws_auto_scaler.py#L163
Hey DeliciousBluewhale87 ,
It seems like this log is the log of a task that was pulled by the agent running on the clearml-services pod, is this the case? Where did you find the above log?
Also - can you please send us the list of all the running pods in the namespace? I want to make sure the other agents are up.
Hey SubstantialElk6 ,
I'm assuming you are referring to our helm charts?
If so, then you can set agent.dockerMode to false ( https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-chart/values.yaml#L46 ), and then the docker.sock configuration will be turned off. Note that this means that your agents will not be running on docker mode π
Again, assuming you are referring to the helm charts. How are you deploying ClearML?
We will fix it and remove the deprecated flags.
In any case it shouldn't cause issues with your tasks.. Is it running?
Or - which api-server the UI is actually connecting to? π
I waited 20 mins, refreshing the logs ever 2 mins.
Sounds like more than enough
Let me know if this solves your problem
Hey SubstantialElk6 ,
This issue was fixed in the latest clearml-agent version.
Please try using v0.17.2 π
BTW, is there any specific reason for not upgrading to clearml? π
So the issue was probably the clearml-agent version.
Please try using clearml-agent==0.17.2rc3 and let us know if this solved the issue.
Hey GreasyPenguin14 ,
The docker-compose.yml and this section specifically were updated.
So first please try again with the new version π
Second - this error seems a bit odd, which version of docker-compose are you using?
You can check this using: docker-compose --version
Can you check which trains version appears under the package requirements for the autoscaler?
What do you mean by ' not taking effect with the k8s glue '?
security_group_ids = ["<sec_group_id>"] (note that I had a typo it's the id not the name, don't want to misguide you!)
Seems like the env variable isn't passed for some reason, we'll push a fix for this issue soon, I'll keep you posted π
Just making sure, you changed both the agent one and the agent-services one?
Sure, ping me if it's still happening.
Can you try removing the port from the webhost?
Probably something's wrong with the instance, which AMI you used? the default one?
subnet isn't supported as is in autoscaler, but you can add it using extra_configurations the following way:extra_configurations = {'SubnetId': <subnet-id>}
That's the agent-services one, can you check the agent's one?
Make sure you're testing it on the same computer the autoscaler is running on
Searching this error it seems it could be many things.
Either wrong credentials or a wrong region (different than the one for your key-pair).
It could also be that your computer clock is wrong (seeΒ example https://github.com/mitchellh/vagrant-aws/issues/372#issuecomment-87429450 ).
I suggest you search it online and see if it solves the issue, I think it requires some debugging on your end.
Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.
Good, are we sure that the problem is that the variable isn't set?
Can you please use kubectl describe pod <task-pod-name> and send me the output?
Hey SubstantialElk6 ,
You can see the bash script that installs the container https://github.com/allegroai/clearml-agent/blob/master/clearml_agent/glue/k8s.py#L61 .
You are correct that it does do apt-get update in order to install some stuff.
You can override this entire list of commands by adding another bash script as a string using the container_bash_script argument. Make sure you add it to the example script (should be added to the initialization https://github.com/allegr...
If the configurations and hyper params still appear properly in the task there's no need to rerun the wizard. just make sure you're using the updated trains repo