ShaggyHare67 notice that the services queue is designed to run CPU based tasks like monitoring etc.
For the actual training you need to run your
trains-agent on a GPU machine.
Did you run the
trains-agent init ? it will walk you through the configuration (git user/pass) included.
If you want to manually add them, you can see an example of the configuration file in the link below.
You can find it on
A more detailed instructions:
Hi ShaggyHare67 ,
Yes the trains.conf created by
trains-agent is basically an extension of the
trains usage (specifically it adds a section for the agent)
I'm assuming you are running the agent on the same development machine.
I guess the easiest is to rename the trains.conf to trains.conf.old and run
(No need to worry, the trains package supports it , so the new configuration file that will be generated will work just fine)
So obviously that is the problem
ShaggyHare67 how come the "installed packages" are now empty ?
They should be automatically filled when executing locally?!
Any chance someone mistakenly deleted them?
Regrading the python environment,
trains-agent is creating a new clean venv for every experiment, if you need you can set in your
This will cause the newly created venv to inherit the packages from the system, meaning it should have the trains package if it is already installed.
What do you think?
My screen is the same as this one (except in the available workers I only have trains-services)
is running my code but it is unable to import
What you are saying is you spin the 'trains-agent' inside a docker? but in venv mode ?
On the server I have both python (2.7) and python3,
Hmm make sure that you run the agent with
python3 trains-agent this way it will use the python3 for the experiments
Yes, this seems like the problem, you do not have an agent (trains-agent) connected to your server.
The agent is responsible for pulling the experiments and executing them.
pip install trains-agent trains-agent init trains-agent daemon --gpus all
AgitatedDove14 Yes that's exactly what I have when I create the
UniformParameterRange() but it's still not found as a hyper parameter
I am using the learning rate and the other parameters in the model when I train by calling
Adam(...) with all the Adam configs
Wish I could've sent you the code but it's on another network not exposed to the public..
I'm completely lost
It's looking like this:
opt = Adam(**configs['training_configuration']['optimizer_params']['Adam'])
model.compile(optimizer=opt, ........more params......)
and at the beginning of the code I do
task.connect(configs['training_configuration'], name="Train") which I do see the right params under Train in the UI
later on the hparams script I do:
UniformParameterRange('Train/optimizer_params/Adam/learning_rate', ....the rest of the min max step params.....)
(with the rest of the code like in the example)
The thing is, on each of the drafts in the UI, I do see it's updating the right parameter under
Train/optimizer_params/Adam/learning_rate with the step and everything. But at the script it says it can't find the hyper parameter and also it's finishing real quick so I know it's not really doing anything
ShaggyHare67 I'm just making sure I understand the setup:
First "manual" run of the base experiment. It creates an experiment in the system, you see all the hyper parameters under General section.
trains-agent running on a machine HPO example is executed with the above HP as optimization paamateres HPO creates clones of the original experiment, with different configurations (verified in the UI) trains-agent executes said experiments, aand they are not completed.But it seems the parameters are not being changed.
Things to check:
Task.connect called before the dictionary is actually used Just in case, do
print(configs['training_configuration']) after the Task.connect call, making sure the parameters were passed correctly
What should have happened is the experiments should have been pending (i.e. in a queue)
(Not sure why they are not).
You can manually send them for execution , right click on an experiment in the able, select enqueue and select the default queue (This will be the one the trains-agent will pull from , by default)
I manually sent one to queue, then it started running but failed and appearantly, the trains can't access my git repository
I tried docker-compose -f down, doing
export TRAINS_AGENT_GIT_USER=(my_user) export TRAINS_AGENT_GIT_PASS=(my_pass)and then docker-compose -f up but I get the same error
Now it's running and I do see the gpu:gpuall in the available workers, running the script still produces the "Could not find request hyper-parameters..."
And also the optimizers are still on draft (except the main one which was created for them)
AgitatedDove14 So I managed to get trains-agent to access the git but now I'm facing another issue:
trains-server is running on a remote server (which I ssh to), on that server I have my own docker which is where I write the code, and also on this docker I do
trains-agent is running my code but it is unable to import
trains inside the code (and potentially more packages).
On the server I have both python (2.7) and python3, maybe it is automatically running
python command (and not
python3 ) so it doesn't have the package?
Also, is the code ran by trains will be executed from the server (where
trains-server ) is running or from inside my docker (where
trains-agent is running)?
Note that I can't have access to root on the server (only in my docker), so changing stuff like re-installing, etc, is not possible
AgitatedDove14 I write the
docker-compose up for
trains-server inside my server.
On my server I run my own docker (that contains all my code and my packages) and also there I do the
trains-agent daemon --gpus all command.
How can I make the
trains-agent run the python that I normally run? (located in
I tried editing the trains-agent conf and changed
python_binary=/home/user/miniconda/envs36/bin/python but it didn't solve it.
I also tried editing
package_manager.system_site_packages=true which didn't work
AgitatedDove14 Sadly I can not send the console log because it is on a different computer (and on a different, closed network). But in the log it is able to clone the repository, executing the right py file and then crashes on the line with I import trains.
The experiment has no packages identified under installed packages. So obviously that is the problem but as I've stated in my previous comment I am trying to link it to run on
/home/user/miniconda/envs36/bin/python , or am I missing something?
I also tried editing
which didn't work
Sadly that didn't do the trick, I wonder how come I don't have the installed packages?
I guarantee no one has deleted them, and it's a bit weird since I can run my code normally, it's just that
trains doesn't discover them for some reason.
AgitatedDove14 Quick update: Apparently the base (template) code we run (with the main model) which were 2 weeks ago ~ 1 month ago, it did show installed packages but now it doesn't. Nothing changed in
trains settings /
trains-server settings so I wonder what could cause that?
ShaggyHare67 could you send the console log
trains-agent outputs when you run it?
is running my code but it is unable to import
Do you have the package "trains" listed under "installed packages" in your experiment?
Regarding step #5 I'm not sure how to check it, what I see in the UI are 5 drafts (concurrent_tasks is set to 5) and the "main" task init incharge of them, and there are clones of the original base experiment with different configurations (although they're not really a clone, only the configs are cloned. the artifacts output model and the results aren't cloned)
And for the things to check - Yup it's like that and still the same error