ShaggyHare67 I'm just making sure I understand the setup:
First "manual" run of the base experiment. It creates an experiment in the system, you see all the hyper parameters under General section. trains-agent
running on a machine HPO example is executed with the above HP as optimization paamateres HPO creates clones of the original experiment, with different configurations (verified in the UI) trains-agent executes said experiments, aand they are not completed.But it seems the paramete...
FlatOctopus65
In my local environment
pipeline_package
is installed in development mode
In order to install the package you need to specify the git repo of the package, this is how the pipeline would know where to bring it from.
Either install it locally with "pip install git+ https://github.com/ ...." or add tp the packages
argument of the Pipeline wrapper packages = ["git+
https://github.com/
"] `
wdyt?
Hi ShaggyHare67 ,
Yes the trains.conf created by trains-agent
is basically an extension of the trains
usage (specifically it adds a section for the agent)
I'm assuming you are running the agent on the same development machine.
I guess the easiest is to rename the trains.conf to trains.conf.old and run trains-agent init
(No need to worry, the trains package supports it , so the new configuration file that will be generated will work just fine)
(fyi: once we have a solid idea here, please open a github issue on the feature request, I'll try to see if we can push it fwd for the next RC ๐ )
ConfusedPig65 could you send the full log (console) of this execution?
@<1564422644407734272:profile|DistressedCoyote60> could you open a GitHub issue on it in clearml-agent, just so we know of the problem and fix it for next version ?
You can always log it manually:from clearml import InputModel input_model = InputModel.import_model(weights_url='/tmp/keras_example/weight.6.hdf5')
Hi WickedGoat98
"Failed uploading to //:8081/files_server:"
Seems like the problem. what do you have defined as files_server in the trains.conf
So I have a task that just loads a model, but I don't see it as an artifact in the UI
You should see it under Artifacts, Input model if you are calling Keras load function (or similar)
The first pipeline
ย step is calling init
GiddyPeacock64 Is this enough to track all the steps?
I guess my main question is every step in the pipeline an actual Task/Job or is it a single small function?
Kubeflow is great for simple DAGs but when you need to build more complex logic it is usually a bit limited
(for example the visibility into what's going on inside each step is missing so you cannot make a decision based on that).
WDYT?
(BTW: draft means they are in edit mode, i.e. before execution, then they should be queued (i.e. pending) then running then completed)
GiddyPeacock64 Are you sending the jobs from JupyterLab Kale extension ?
EDIT:
Is the pipeline step itself calling Task.init?
PricklyJellyfish35
Do you mean the original OmegaConf, before the overrides ? or the configuration files used to create the OmegaConf ?
Yes, this seems like the problem, you do not have an agent (trains-agent) connected to your server.
The agent is responsible for pulling the experiments and executing them.pip install trains-agent trains-agent init trains-agent daemon --gpus all
ShaggyHare67 notice that the services queue is designed to run CPU based tasks like monitoring etc.
For the actual training you need to run your trains-agent
on a GPU machine.
Did you run the trains-agent init
? it will walk you through the configuration (git user/pass) included.
If you want to manually add them, you can see an example of the configuration file in the link below.
You can find it on ~\trains.conf
https://github.com/allegroai/trains-agent/blob/master/docs/tr...
See the log:
Collecting keras-contrib==2.0.8
File was already downloaded c:\users\mateus.ca\.clearml\pip-download-cache\cu0\keras_contrib-2.0.8-py3-none-any.whl
so it did download it, but it failed to pass it correctly ?!
Can you try with clearml-agent==1.5.3rc2
?
Things to check:
Task.connect called before the dictionary is actually used Just in case, do configs['training_configuration']=Task.connect(configs['training_configuration'])
add print(configs['training_configuration'])
after the Task.connect call, making sure the parameters were passed correctly
would those containers best be started from something in services mode?
Yes as long as the machine has enough cpu/ram
Notice that the services mode will start a second parallel Task after the first one is done setting up the env, if running with CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL, with containers that have git/python/clearml-agent preinstalled it should be minimal.
or is it possible to get no-overhead with my approach of worker-inside-docker?
No do not do that, see above e...
Go to the workers & queues, page right side panel 3rd icon from the top
What should have happened is the experiments should have been pending (i.e. in a queue)
(Not sure why they are not).
You can manually send them for execution , right click on an experiment in the able, select enqueue and select the default queue (This will be the one the trains-agent will pull from , by default)
im not running in docker mode though
hmmm that might be the first issue. it cannot skip venv creation, it can however use a pre-existing venv (but it will change it every time it installs a missing package)
so setting CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1 in non docker mode has no affect
Maybe it's the Azure upload that has a weird size bug?!
So why is it trying to upload to "//:8081/files_server:" ?
What do you have in the trains.conf on the machine running the experiment ?
I see in the UI are 5 drafts
What's the status of these 5 experiments? draft ?
And voila full trace including Git and uncommitted changes, python packages, and the ability to change arguments from the UI ๐
Hmm you mean like overrides ?
Maybe store both before/after resolving ?
(Although that might be confusing? as the before solve should actually be readonly)