Actually I removed the key pair, as you said it wasn't a must in the newer versions
It isn't a must, but if you are using one, it should be in the same region
If the configurations and hyper params still appear properly in the task there's no need to rerun the wizard. just make sure you're using the updated trains repo
Make sure you're testing it on the same computer the autoscaler is running on
Sure, we're using RunInstances
, you can see the call itself https://github.com/allegroai/trains/blob/master/trains/automation/aws_auto_scaler.py#L163
Great, let us know how it goes.
Have a great weekend!
Or - which api-server the UI is actually connecting to? 🙂
When you open the UI, do you see any projects there?
ColossalAnt7 can you try connecting to one of the trains-agent pods and run trains-agent manually using the following command:TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-version
Then let us know what happens and if you see the new worker it in the UI
You can try overriding the following in your values.yaml under agent
section:agentVersion: "==0.16.2rc1"
Hey ColossalAnt7 ,
What version of trains-agent are you using?
You can try upgrading to the latest RC version, this issue should be fixed there:pip install trains-agent==0.16.2rc1
To check, go to the experiment's page and then to EXECUTIONÂ >Â AGENT CONFIGURATIONÂ >Â BASE DOCKER IMAGE
If it's set to any value, clearing it would solve your problem.
Hey LovelyHamster1 ,
Any chance the task you are trying to run has a base docker defined in it?
Hey DeliciousBluewhale87 ,
It seems like this log is the log of a task that was pulled by the agent running on the clearml-services pod, is this the case? Where did you find the above log?
Also - can you please send us the list of all the running pods in the namespace? I want to make sure the other agents are up.
By the way, are you editing the values directly? Why not use the values file?
Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost
and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.
That's the agent-services one, can you check the agent's one?
Hey LovelyHamster1 ,
If s3 is what you're interested of, then the above would do the trick.
Note that you can attach the IAM using instance profiles. You can read about those here:
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html
Once you have an instance profile, you can add it to the autoscaler using the extra_configurations
section in the autoscaler.
Under your resource_configurations
-> some resource name
-> add an ...
Hey LovelyHamster1 ,
This means that for some reason the agent on the instances created fails to run and the instance is terminated.
The credentials could definatly cause that.
Can you try adding the credentials as they appear in your clearml.conf?
To do so, create new credentials from your profile page in the UI, and add the entire section to the extra_trains_conf
section in the following way:
` extra_trains_conf = """
api {
web_server: "<webserver>"
api_server: "<apiserver>"
...
Hey WackyRabbit7 ,
Is this the only error you have there?
Can you verify the credentials in the task seem ok and that it didn't disappear as before?
Also, I understand that the Failed parsing task parameter ...
warnings no longer appear, correct?
Hey SubstantialElk6 ,
Can you show us the top output you get when using the template-yaml instead of overrides-yaml?
I'd suggest that you try what AgitatedDove14 suggested https://clearml.slack.com/archives/CTK20V944/p1614540843119600?thread_ts=1613923591.002100&cid=CTK20V944 . It seems like you're using an older version of the agent somehow.
I think using the glue could be a good solution for you, so it seems like a good idea to try and get it to work.
SubstantialElk6 - As a side-note, since docker is about to be deprecated, sometime in the near future we plan to switch to another runtime. This actually means that the entire docker.sock issue will not be relevant very soon 🙂
Seems like the env variable isn't passed for some reason, we'll push a fix for this issue soon, I'll keep you posted 🙂
Can you try setting the base_docker_image of the specific task you are running to nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
?
To do so go to the task's execution tab, scroll down and set the base docker section to the above.
We will fix it and remove the deprecated flags.
In any case it shouldn't cause issues with your tasks.. Is it running?
Hey SubstantialElk6 ,
This issue was fixed in the latest clearml-agent version.
Please try using v0.17.2 🙂
So the issue was probably the clearml-agent version.
Please try using clearml-agent==0.17.2rc3 and let us know if this solved the issue.
Again, assuming you are referring to the helm charts. How are you deploying ClearML?