Reputation
Badges 1
92 × Eureka!I an running trains-server on AWS with your AMI (instance type t3.large)
The server runs very good, and works amazing!
Until we start to run more training in parallel (around 20).
Then, the UI start to be very slow and getting timeouts often.
Does upgrading the instance type can help here? or there is some limit to parallel running?
I reproduced the stuck with this code..
But for now only with my env , when I tried to create new env only with the packages that this code needed it wont stuck.
So maybe the problem is conflict between packages?
SuccessfulKoala55 it still stuck on the same line .. does it should be like this?
yes it looks like this.. I just wanted to understand if it is should be so slow.. or I did something wrong
So I ask my boss and DevOps and they say for now we can use the root user inside the docker image...
WOW.. Thanks 💯
I tried you solution but since my path is to a YAML file,
and task.set_configuration_object(name=name, config_taxt=my_params) upload this not in the same format task.connect_configuration(path, name=name) it not working for me 😞
(even when I am using config_type='yaml' )
Ok looks It is starting the training...
Thanks 💯
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
This sounds a good reason haha 😄
Let me check if we can hack something...
Thanks 🙏
From the UI it will since it getting the temp file from there.
I mean from the code (let say remotely)
I have one computer with 4 GPUs and like to create a queue over the gpus..
For now the project runs without queue.
My configs holds the relative paths to the data (and it can take time to change all of them) so I prefer to work in relative paths if it possible..
Sure, love to do it when I have more time 🙂
I don't have time to debug it yet.. will update more when I will have more time..
Thanks 🙏
Thanks I will upgrade the server for now and will let you know
I didn't try trains-agent yet, does it support using AWS batch?
For now we are using AWS batch for running those experiments.
because like this we don`t have to hold machines who waits for the jobs
Hi SuccessfulKoala55 and AgitatedDove14 ,
Thanks for the quick replay.
I'm not sure I understand your use-case - do you always want to change the contents of the file in your code? Why not change it before connecting?
Changing the file before the connect will make sense only when I am running locally and the file exists. Remotely I must get the file with connect_configuration(path, name=name) before I am reading it.
"local_path" is ignored, path is a temp file, and the c...
Thanks!! you are the best..
I will give it a try when the runs will finish
Hi CumbersomeCormorant74 ,
This is a server we installed.
The server version is: 0.17
We checked with Chrome, and FireFox
Thanks, ophir
Hi SuccessfulKoala55 Thanks for the replay..
So for now, if I like to upgrade to the latest trains-server but on another machine and keep all the data.
what is the best practice?
Thanks again 🙂
Yes this is what we are doing 👍