Reputation
Badges 1
92 × Eureka!OHH nice, I thought that it just some kind of job queue on up and running machines
Hi SuccessfulKoala55 Thanks for the replay..
So for now, if I like to upgrade to the latest trains-server
but on another machine and keep all the data.
what is the best practice?
Thanks again 🙂
For now we are using AWS batch for running those experiments.
because like this we don`t have to hold machines who waits for the jobs
my docker has my project on it all ready so I know where to mount. Maybe the agent moves/create copy of my project somewhere else?
SuccessfulKoala55 Thanks 🙏 I will give it a try tomorrow 🙂
The hang is still happening in trains==0.15.2rc0
Thanks I just want to avoid giving the credentials to every user.
If it won't possible, I will do it..
I tried you solution but since my path is to a YAML file,
and task.set_configuration_object(name=name, config_taxt=my_params)
upload this not in the same format task.connect_configuration(path, name=name)
it not working for me 😞
(even when I am using config_type='yaml'
)
Hey... Thanks for checking with me.
I didn't have time yet but will check it and let you know..
I am trying to reproduce it with little example
AgitatedDove14 Hi, sorry for the long delay.
I tried to use 0.16 instead of 0.13.1.
I didn't have time to debug it (I am overwhelming with work right now).
But it doesn't work the same as 0.13.1. I am still getting some hanging in my eval process.
I am don't know if it just slower or really stuck since I killed it and move back to 0.13.1 until my busy time will pass.
Thanks
I an running trains-server on AWS with your AMI (instance type t3.large)
The server runs very good, and works amazing!
Until we start to run more training in parallel (around 20).
Then, the UI start to be very slow and getting timeouts often.
Does upgrading the instance type can help here? or there is some limit to parallel running?
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
This sounds a good reason haha 😄
Let me check if we can hack something...
Thanks 🙏
Hi, AgitatedDove14 Thanks for the answer.
I think the upload reporting (files over 5mb) was added post 0.17 version,
That what I thought...
I think it can be helpful to add it to the conf since 5MB is really small and my files are ~300MB, meaning 60 messages for each upload.
Another option is maybe to configure it as Task.init() parameter
I think both are OK 🙂
hey, I test it, it looks it works, still it takes much time (mainly in the second run of the code, it part of my eval process)
Yes this is what we are doing 👍
Hi AppetizingMouse58 , I had around 200GB when I started the migration now I have 169GB/
And yes, It looks it is growing was 9.4GB and now 9.5G
SuccessfulKoala55 and AppetizingMouse58 Thanks you very much!!
I have a future question:
Does this fix should harm in future cleraml-server upgrade?
Or what the best practice to upgrade after doing it?
Sure, love to do it when I have more time 🙂
Does it possible to know in advance where the Agent will clone the code?
Or running a link command just before the execution of the code?