I manually sent one to queue, then it started running but failed and appearantly, the trains can't access my git repository
I tried docker-compose -f down, doingexport TRAINS_AGENT_GIT_USER=(my_user) export TRAINS_AGENT_GIT_PASS=(my_pass)
and then docker-compose -f up but I get the same error
ShaggyHare67 are you saying the problem is trains
fails discovering the packages in the manual execution ?
Go to the workers & queues, page right side panel 3rd icon from the top
It's suppose to be running, how can I double check?
(I'm using my own Trains server)
ShaggyHare67
Now the
trains-agent
is running my code but it is unable to import
trains
...
What you are saying is you spin the 'trains-agent' inside a docker? but in venv mode ?
On the server I have both python (2.7) and python3,
Hmm make sure that you run the agent with python3 trains-agent
this way it will use the python3 for the experiments
A more detailed instructions:
https://github.com/allegroai/trains-agent#installing-the-trains-agent
Things to check:
Task.connect called before the dictionary is actually used Just in case, do configs['training_configuration']=Task.connect(configs['training_configuration'])
add print(configs['training_configuration'])
after the Task.connect call, making sure the parameters were passed correctly
ShaggyHare67 in the HPO the learning should be (based on the above):General/training_config/optimizer_params/Adam/learning_rate
Notice the "General" prefix (notice it is case sensitive)
I see in the UI are 5 drafts
What's the status of these 5 experiments? draft ?
ShaggyHare67 I'm just making sure I understand the setup:
First "manual" run of the base experiment. It creates an experiment in the system, you see all the hyper parameters under General section. trains-agent
running on a machine HPO example is executed with the above HP as optimization paamateres HPO creates clones of the original experiment, with different configurations (verified in the UI) trains-agent executes said experiments, aand they are not completed.But it seems the parameters are not being changed.
Correct?
Now it's running and I do see the gpu:gpuall in the available workers, running the script still produces the "Could not find request hyper-parameters..."
And also the optimizers are still on draft (except the main one which was created for them)
AgitatedDove14 Quick update: Apparently the base (template) code we run (with the main model) which were 2 weeks ago ~ 1 month ago, it did show installed packages but now it doesn't. Nothing changed in trains
settings / trains-server
settings so I wonder what could cause that?
AgitatedDove14 When I did trains-agent init
it says there's already an init file, and when I open it it begins with # TRAINS SDK configuration file and it looks a little bit different than the config file you sent, how should I play this?
(BTW: draft means they are in edit mode, i.e. before execution, then they should be queued (i.e. pending) then running then completed)
Yes, this seems like the problem, you do not have an agent (trains-agent) connected to your server.
The agent is responsible for pulling the experiments and executing them.pip install trains-agent trains-agent init trains-agent daemon --gpus all
I also tried editing
package_manager.system_site_packages=true
which didn't work
Sadly that didn't do the trick, I wonder how come I don't have the installed packages?
I guarantee no one has deleted them, and it's a bit weird since I can run my code normally, it's just that trains
doesn't discover them for some reason.
AgitatedDove14 I write the docker-compose up
for trains-server
inside my server.
On my server I run my own docker (that contains all my code and my packages) and also there I do the trains-agent daemon --gpus all
command.
How can I make the trains-agent
run the python that I normally run? (located in /home/user/miniconda/envs/36/bin/python
)
I tried editing the trains-agent conf and changed python_binary=/home/user/miniconda/envs36/bin/python
but it didn't solve it.
I also tried editing package_manager.system_site_packages=true
which didn't work
AgitatedDove14 So I managed to get trains-agent to access the git but now I'm facing another issue:
The trains-server
is running on a remote server (which I ssh to), on that server I have my own docker which is where I write the code, and also on this docker I do trains-agent
commands
Now the trains-agent
is running my code but it is unable to import trains
inside the code (and potentially more packages).
Any idea?
On the server I have both python (2.7) and python3, maybe it is automatically running python
command (and not python3
) so it doesn't have the package?
Also, is the code ran by trains will be executed from the server (where trains-server
) is running or from inside my docker (where trains-agent
is running)?
Note that I can't have access to root on the server (only in my docker), so changing stuff like re-installing, etc, is not possible
What should have happened is the experiments should have been pending (i.e. in a queue)
(Not sure why they are not).
You can manually send them for execution , right click on an experiment in the able, select enqueue and select the default queue (This will be the one the trains-agent will pull from , by default)
ShaggyHare67 notice that the services queue is designed to run CPU based tasks like monitoring etc.
For the actual training you need to run your trains-agent
on a GPU machine.
Did you run the trains-agent init
? it will walk you through the configuration (git user/pass) included.
If you want to manually add them, you can see an example of the configuration file in the link below.
You can find it on ~\trains.conf
https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf
So obviously that is the problem
Correct.
ShaggyHare67 how come the "installed packages" are now empty ?
They should be automatically filled when executing locally?!
Any chance someone mistakenly deleted them?
Regrading the python environment, trains-agent
is creating a new clean venv for every experiment, if you need you can set in your trains.conf
:agent.package_manager.system_site_packages: true
https://github.com/allegroai/trains-agent/blob/de332b9e6b66a2e7c6736d12614de9870eff48bc/docs/trains.conf#L55
This will cause the newly created venv to inherit the packages from the system, meaning it should have the trains package if it is already installed.
What do you think?
AgitatedDove14 Yes that's exactly what I have when I create the UniformParameterRange()
but it's still not found as a hyper parameter
I am using the learning rate and the other parameters in the model when I train by calling keras.optimizers
Adam(...)
with all the Adam configs
Wish I could've sent you the code but it's on another network not exposed to the public..
I'm completely lost
Edit:
It's looking like this:
opt = Adam(**configs['training_configuration']['optimizer_params']['Adam'])
model.compile(optimizer=opt, ........more params......)
Configs:....more params....
training_configuration:
optimizer_params:
Adam:
learning_rate: 0.1
decay: 0
.....more params....
and at the beginning of the code I do task.connect(configs['training_configuration'], name="Train")
which I do see the right params under Train in the UI
later on the hparams script I do: UniformParameterRange('Train/optimizer_params/Adam/learning_rate', ....the rest of the min max step params.....)
(with the rest of the code like in the example)
The thing is, on each of the drafts in the UI, I do see it's updating the right parameter under Train/optimizer_params/Adam/learning_rate
with the step and everything. But at the script it says it can't find the hyper parameter and also it's finishing real quick so I know it's not really doing anything
Hi ShaggyHare67 ,
Yes the trains.conf created by trains-agent
is basically an extension of the trains
usage (specifically it adds a section for the agent)
I'm assuming you are running the agent on the same development machine.
I guess the easiest is to rename the trains.conf to trains.conf.old and run trains-agent init
(No need to worry, the trains package supports it , so the new configuration file that will be generated will work just fine)
Regarding step #5 I'm not sure how to check it, what I see in the UI are 5 drafts (concurrent_tasks is set to 5) and the "main" task init incharge of them, and there are clones of the original base experiment with different configurations (although they're not really a clone, only the configs are cloned. the artifacts output model and the results aren't cloned)
And for the things to check - Yup it's like that and still the same error
AgitatedDove14 Sadly I can not send the console log because it is on a different computer (and on a different, closed network). But in the log it is able to clone the repository, executing the right py file and then crashes on the line with I import trains.
The experiment has no packages identified under installed packages. So obviously that is the problem but as I've stated in my previous comment I am trying to link it to run on /home/user/miniconda/envs36/bin/python
, or am I missing something?
ShaggyHare67 could you send the console log trains-agent
outputs when you run it?
Now the
trains-agent
is running my code but it is unable to import
trains
Do you have the package "trains" listed under "installed packages" in your experiment?
Which panel?
https://demoapp.trains.allegro.ai/workers-and-queues/workers
My screen is the same as this one (except in the available workers I only have trains-services)