Reputation
Badges 1
92 × Eureka!Thanks I will upgrade my instance type and the add more workers. where I need to configure it?
Hi AppetizingMouse58 , I had around 200GB when I started the migration now I have 169GB/
And yes, It looks it is growing was 9.4GB and now 9.5G
I an running trains-server on AWS with your AMI (instance type t3.large)
The server runs very good, and works amazing!
Until we start to run more training in parallel (around 20).
Then, the UI start to be very slow and getting timeouts often.
Does upgrading the instance type can help here? or there is some limit to parallel running?
For now we are using AWS batch for running those experiments.
because like this we don`t have to hold machines who waits for the jobs
Thanks!! you are the best..
I will give it a try when the runs will finish
Thanks I am basing my docker on https://github.com/facebookresearch/detectron2/blob/master/docker/Dockerfile
Hi SuccessfulKoala55 Thanks for the replay..
So for now, if I like to upgrade to the latest trains-server
but on another machine and keep all the data.
what is the best practice?
Thanks again 🙂
I did it just because FAIR did it in detectron2 Dockerfile
Hi AgitatedDove14 ,
Sorry for the late response It was late at my country 🙂 .
This what I am gettingappuser@219886f802f0:~$ sudo su root root@219886f802f0:/home/appuser# whoami root
So for now I am leaving this issue...
Thanks a lot 🙏 🙌
Ok looks It is starting the training...
Thanks 💯
So I ask my boss and DevOps and they say for now we can use the root
user inside the docker image...
Thanks, I will make sure that all the python packages install as root..
And will let you know if it works
SuccessfulKoala55 Thanks 🙏 I will give it a try tomorrow 🙂
I tried without yaml.dump(my_params_dict)
will try with it..
so the file was not the same as the connect_configuration uploaded
Thanks
I update to the new version 0.16.1 few weeks away and it works using the elastic_upgrade.py
yes it looks like this.. I just wanted to understand if it is should be so slow.. or I did something wrong
AgitatedDove14 Maybe I need to change something here: apiserver.conf
for increasing workers number?
` [2021-01-24 17:02:25,660] [8] [INFO] [trains.service_repo] Returned 200 for queues.get_all in 2ms
[2021-01-24 17:02:25,674] [8] [INFO] [trains.service_repo] Returned 200 for queues.get_next_task in 8ms
[2021-01-24 17:02:26,696] [8] [INFO] [trains.service_repo] Returned 200 for events.add_batch in 36ms
[2021-01-24 17:02:26,742] [8] [INFO] [trains.service_repo] Returned 200 for events.add_batch in 78ms
[2021-01-24 17:02:27,169] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_al...
SuccessfulKoala55 it still stuck on the same line .. does it should be like this?