Reputation
Badges 1
13 × Eureka!AgitatedDove14 When I did trains-agent init
it says there's already an init file, and when I open it it begins with # TRAINS SDK configuration file and it looks a little bit different than the config file you sent, how should I play this?
AgitatedDove14 Quick update: Apparently the base (template) code we run (with the main model) which were 2 weeks ago ~ 1 month ago, it did show installed packages but now it doesn't. Nothing changed in trains
settings / trains-server
settings so I wonder what could cause that?
AgitatedDove14 I write the docker-compose up
for trains-server
inside my server.
On my server I run my own docker (that contains all my code and my packages) and also there I do the trains-agent daemon --gpus all
command.
How can I make the trains-agent
run the python that I normally run? (located in /home/user/miniconda/envs/36/bin/python
)
I tried editing the trains-agent conf and changed python_binary=/home/user/miniconda/envs36/bin/python
but it didn't...
Which panel?
https://demoapp.trains.allegro.ai/workers-and-queues/workers
My screen is the same as this one (except in the available workers I only have trains-services)
AgitatedDove14 Sadly I can not send the console log because it is on a different computer (and on a different, closed network). But in the log it is able to clone the repository, executing the right py file and then crashes on the line with I import trains.
The experiment has no packages identified under installed packages. So obviously that is the problem but as I've stated in my previous comment I am trying to link it to run on /home/user/miniconda/envs36/bin/python
, or am I missin...
AgitatedDove14 So I managed to get trains-agent to access the git but now I'm facing another issue:
The trains-server
is running on a remote server (which I ssh to), on that server I have my own docker which is where I write the code, and also on this docker I do trains-agent
commands
Now the trains-agent
is running my code but it is unable to import trains
inside the code (and potentially more packages).
Any idea?
On the server I have both python (2.7) and python3,...
I manually sent one to queue, then it started running but failed and appearantly, the trains can't access my git repository
I tried docker-compose -f down, doingexport TRAINS_AGENT_GIT_USER=(my_user) export TRAINS_AGENT_GIT_PASS=(my_pass)
and then docker-compose -f up but I get the same error
I also tried editing
package_manager.system_site_packages=true
which didn't work
Sadly that didn't do the trick, I wonder how come I don't have the installed packages?
I guarantee no one has deleted them, and it's a bit weird since I can run my code normally, it's just that trains
doesn't discover them for some reason.
AgitatedDove14 Yes that's exactly what I have when I create the UniformParameterRange()
but it's still not found as a hyper parameter
I am using the learning rate and the other parameters in the model when I train by calling keras.optimizers
Adam(...)
with all the Adam configs
Wish I could've sent you the code but it's on another network not exposed to the public..
I'm completely lost
Edit:
It's looking like this:
` opt = Adam(**configs['training_configuration']['optim...
It's suppose to be running, how can I double check?
(I'm using my own Trains server)
Now it's running and I do see the gpu:gpuall in the available workers, running the script still produces the "Could not find request hyper-parameters..."
And also the optimizers are still on draft (except the main one which was created for them)
Regarding step #5 I'm not sure how to check it, what I see in the UI are 5 drafts (concurrent_tasks is set to 5) and the "main" task init incharge of them, and there are clones of the original base experiment with different configurations (although they're not really a clone, only the configs are cloned. the artifacts output model and the results aren't cloned)
And for the things to check - Yup it's like that and still the same error