Make sure you're testing it on the same computer the autoscaler is running on
That's great, from that I understand that the trains-services worker does appear in the UI, is it correct? Did the task run? Did you change the trainsApiHost
under agentservices
in the values.yaml?
BTW, is there any specific reason for not upgrading to clearml? 🙂
Actually I removed the key pair, as you said it wasn't a must in the newer versions
It isn't a must, but if you are using one, it should be in the same region
Hey WackyRabbit7 ,
Is this the only error you have there?
Can you verify the credentials in the task seem ok and that it didn't disappear as before?
Also, I understand that the Failed parsing task parameter ...
warnings no longer appear, correct?
If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on
As an example you can ssh to it and try running trains-agent manually to see if it's installed and if it fails for some reason.
Also, can you send the entire log?
If the configurations and hyper params still appear properly in the task there's no need to rerun the wizard. just make sure you're using the updated trains repo
Just making sure, you changed both the agent one and the agent-services one?
Can you check which trains version appears under the package requirements for the autoscaler?
I waited 20 mins, refreshing the logs ever 2 mins.
Sounds like more than enough
ok, so first, since you have many installations in your bash script, it does make sense that installation would take a long time (note that the agent will only start running after all installations are done)
So for the sake of debugging I'd suggest to remove all the packages (other than the specific trains-agent that you're using) and try again, add those packages to the task you are trying to run and you should see the instance much faster.
security_group_ids = ["<sec_group_id>"]
(note that I had a typo it's the id not the name, don't want to misguide you!)
ColossalAnt7 can you try connecting to one of the trains-agent pods and run trains-agent manually using the following command:TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-version
Then let us know what happens and if you see the new worker it in the UI
Probably something's wrong with the instance, which AMI you used? the default one?
By the way, are you editing the values directly? Why not use the values file?
Did you change anything under the agent's value?
In case you didn't - please try editing the agent.clearmlWebHost
and set it to the value of your webserver (use the same one you used for the agent services).
This might solve your issue.
I understand, but for some reason you are getting an error about the clearml webserver. try changing the value in the values.yaml file for the agent.clearmlWebHost to the same value you filled manually for the agent-services Web host
Can you try removing the port from the webhost?
Or - which api-server the UI is actually connecting to? 🙂
Great, let us know how it goes.
Have a great weekend!
Sure, ping me if it's still happening.
Hey JitteryCoyote63 !
Can you please update us what permissions did you end up using for the autoscaler?
Were the above enough?
Thanks!
Those are different credentials.
You should have the aws info under:cloud_credentials_key
, cloud_credentials_secret
and cloud_credentials_region
And the stuff added to the extra_vm_bash_script
are the trains key and secret from your profile page in the UI.
I suggest you use the wizard again to run the task, this will make sure all the data is where it should be.
You can try overriding the following in your values.yaml under agent
section:agentVersion: "==0.16.2rc1"
Hey GreasyPenguin14 ,
The docker-compose.yml and this section specifically were updated.
So first please try again with the new version 🙂
Second - this error seems a bit odd, which version of docker-compose are you using?
You can check this using: docker-compose --version