Make sure you're testing it on the same computer the autoscaler is running on
We will fix it and remove the deprecated flags.
In any case it shouldn't cause issues with your tasks.. Is it running?
So the issue was probably the clearml-agent version.
Please try using clearml-agent==0.17.2rc3 and let us know if this solved the issue.
Hey WackyRabbit7 ,
Is this the only error you have there?
Can you verify the credentials in the task seem ok and that it didn't disappear as before?
Also, I understand that the Failed parsing task parameter ... warnings no longer appear, correct?
That's the agent-services one, can you check the agent's one?
ColossalAnt7 can you try connecting to one of the trains-agent pods and run trains-agent manually using the following command:TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-versionThen let us know what happens and if you see the new worker it in the UI
Sure, we're using RunInstances , you can see the call itself https://github.com/allegroai/trains/blob/master/trains/automation/aws_auto_scaler.py#L163
Hey GreasyPenguin14 ,
The docker-compose.yml and this section specifically were updated.
So first please try again with the new version 🙂
Second - this error seems a bit odd, which version of docker-compose are you using?
You can check this using: docker-compose --version
If the configurations and hyper params still appear properly in the task there's no need to rerun the wizard. just make sure you're using the updated trains repo
Hey SubstantialElk6 ,
You can see the bash script that installs the container https://github.com/allegroai/clearml-agent/blob/master/clearml_agent/glue/k8s.py#L61 .
You are correct that it does do apt-get update in order to install some stuff.
You can override this entire list of commands by adding another bash script as a string using the container_bash_script argument. Make sure you add it to the example script (should be added to the initialization https://github.com/allegr...
ok, so first, since you have many installations in your bash script, it does make sense that installation would take a long time (note that the agent will only start running after all installations are done)
So for the sake of debugging I'd suggest to remove all the packages (other than the specific trains-agent that you're using) and try again, add those packages to the task you are trying to run and you should see the instance much faster.
Great.
Note that instead of removing those lines you can override it using the extra_vm_bash_script
For example:extra_vm_bash_script = """ export CLEARML_API_HOST=<api_server> export CLEARML_WEB_HOST=<web_server> export CLEARML_FILES_HOST=<files_server> """
Probably something's wrong with the instance, which AMI you used? the default one?
Hey JitteryCoyote63 !
Can you please update us what permissions did you end up using for the autoscaler?
Were the above enough?
Thanks!
I'd suggest that you try what AgitatedDove14 suggested https://clearml.slack.com/archives/CTK20V944/p1614540843119600?thread_ts=1613923591.002100&cid=CTK20V944 . It seems like you're using an older version of the agent somehow.
I think using the glue could be a good solution for you, so it seems like a good idea to try and get it to work.
To check, go to the experiment's page and then to EXECUTIONÂ >Â AGENT CONFIGURATIONÂ >Â BASE DOCKER IMAGE
If it's set to any value, clearing it would solve your problem.
Sure, ping me if it's still happening.
Hey LovelyHamster1 ,
This means that for some reason the agent on the instances created fails to run and the instance is terminated.
The credentials could definatly cause that.
Can you try adding the credentials as they appear in your clearml.conf?
To do so, create new credentials from your profile page in the UI, and add the entire section to the extra_trains_conf section in the following way:
` extra_trains_conf = """
api {
web_server: "<webserver>"
api_server: "<apiserver>"
...
If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on
Hey SubstantialElk6 ,
This issue was fixed in the latest clearml-agent version.
Please try using v0.17.2 🙂
Hey LovelyHamster1 ,
Any chance the task you are trying to run has a base docker defined in it?
Seems like the env variable isn't passed for some reason, we'll push a fix for this issue soon, I'll keep you posted 🙂
Can you try setting the base_docker_image of the specific task you are running to nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true ?
To do so go to the task's execution tab, scroll down and set the base docker section to the above.
By the way, are you editing the values directly? Why not use the values file?
ok that's odd.
Anyway try settingextra_configurations = {"SubnetId": "<subnet-id>"}instead of:extra_configurations = {'SubnetId': "<subnet-id>"}
Hey LovelyHamster1 ,
If s3 is what you're interested of, then the above would do the trick.
Note that you can attach the IAM using instance profiles. You can read about those here:
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html
Once you have an instance profile, you can add it to the autoscaler using the extra_configurations section in the autoscaler.
Under your resource_configurations -> some resource name -> add an ...
Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.
Hey JitteryCoyote63 ,
Autoscaler was tested with full ec2 permissions.
I believe you only need the following:ec2:StartInstances ec2:StopInstances ec2:DescribeInstancesBut there might be some others we're missing.
WackyRabbit7 - I think you asked this question before, do you have some more input you can share here?
Searching this error it seems it could be many things.
Either wrong credentials or a wrong region (different than the one for your key-pair).
It could also be that your computer clock is wrong (see example https://github.com/mitchellh/vagrant-aws/issues/372#issuecomment-87429450 ).
I suggest you search it online and see if it solves the issue, I think it requires some debugging on your end.