Reputation
Badges 1
25 × Eureka!This will fix it, the issue is the "no default value" that breaks the casting@PipelineDecorator.component(cache=False) def step_one(my_arg=""):
Hi UpsetCrocodile10
execute them and return scalars.
This should be a good start (I hope ๐ )
` for child in children:
put the Task into an execution queue
Task.enqueue(child, queue_name='my_queue_here')
wait for the task to finish
child.wait_for_status(status=['completed'])
reload all the metrics
child.reload()
get the metrics
print(child.get_last_scalar_metrics()) `
UpsetCrocodile10
Does this method expectย
my_train_func
ย to be in the same file as
As long as you import it and you can pass it, it should work.
Child exp get's abortedย immediately ...
It seems it cannot find the file "main.py" , it assumes all code is part of a single repository, is that the case ? What do you have under the "Execution" tab for the experiment ?
You cannot change the 8008 port, it has to be 8008 externally (i.e. from the client side).
You can however do subdomain, but only these will work:api.mydomain.com app.mydomain.com files.mydomain.com
@<1523704157695905792:profile|VivaciousBadger56>
Is the idea here the following? You want to use inversion-of-control such that I provide a function
f
to a component that takes the above dict an an input. Then I can do whatever I like inside the function
f
and return a different dict as output. If the output dict of
f
changes, the component is rerun; otherwise, the old output of the component is used?
Yes exactly ! this way you...
Is the code in this "other" repo downloaded to the agent's machine? Or is the component's code pushed to the machine on which the repository is?
Yes this repo is downloaded into the agent, so your code has access to it
mean? Is it not possible that I call code that is somewhere else on my local computer and/or in my code base? That makes things a bit complicated if my current repository is not somehow available to the agent.
I guess you can ignore this argument for the sake of simple discussion. If you need access to extra files/functions, just make sure you point the repo
argument to their repo, and the agent will make sure your code is running from the repo root, with all the repo files under i...
In terms of creating dynamic pipelines and cyclic graphs, the decorator approach seems the most powerful to me.
Yes that is correct, the decorator approach is the most powerful one, I agree.
If you are using the "default" queue for the agent, notice you might need to run the agent with --services-mode
to allow for multiple pipeline components on the same machine
Well it is there, do you have it in your docker-compose as well?
https://github.com/allegroai/trains-server/blob/master/docker-compose.yml#L55
For setting trains-server I would recommend the docker-compose, it is very easy to setup, and you just need a single fixed compute instance, details https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md With regards to the "low prio clusters", are you asking how they could be connected with the trains-agent
or if running code that uses trains
will work on them?
Hi @<1546665666675740672:profile|AttractiveFrog67>
- Make sure you stored the model's checkpoint (either pass
output_uri=True
inTask.init
or manually upload) - When you call
Task.init
pass "continue_last_task=True
" - Now you can do
last_checkpoint=task.models["output"][-1].get_local_copy()
and all you need is to loadlast_checkpoint
I would like to start off by saying that I absolutely love clearml.
@<1547028031053238272:profile|MassiveGoldfish6> thank you for saying that! ๐
Is is possible to download individual files from a dataset without downloading the entire dataset? If so, how do you do that?
Well by default files are packaged into multiple zip files, you can control the size of the zip file for finer granularity, but at the end when you download, you are downloading the entire packaged ...
Hi SlimyElephant79
As you can imagine, wandb's tracking code would be present across the code modules and I was hoping for a structured approach that would help me transition to ClearMLs experiment tracking.
Do you guys a have a layer in between that does the reporting, or is the codebase riddled with direct reporting calls ? if the latter, then I guess search and replace ? or maybe a module that "converts" wandb call to clearml call ? wdyt?
Can you please elaborate on the latter point? My jupyterhubโs fully containerized and allows users to select their own containers (from a list i built) at launch, and launch multiple containers at the same time, not sure I follow how toes are stepped on. (edited)
Definitely a great start, usually it breaks on memory / GPU-mem where too many containers on the same machine are eating each others GPU ram (that cannot be virtualized)
It's a good abstraction for monitoring the state of the platform and call backs, if this is what you are after.
If you just need "simple" cron, then you can always just loop/sleep ๐
I'm guessing this is done through code-server?
correct
I'm currently rolling a JupyterHub instance (multiuser, with codeserver inside) on the same machine as clearml-server. Thatโs where tasks are executed etc. so, all browser dev env.
Yeah, the idea with clearml-session each user can self serve themselves the container that works best for them. With a jupyterhub they start to step on each other's toes very quickly ...
But this config should almost never need to change!
Exactly the idea ๐
notice the password (initially random) is also fixed on your local machine, for the exact same reason
How can i make it such that any update to the upstream database
What do you mean "upstream database"?
ok, but this happens in my local machine, not in the agent
resource monitoring is always running in the background, even on local machines. (of course you can turn it off)
Basically the idea is that you create the pipeline once (say debug), then once you see it is running, you have a Task of your pipeline in the system (with any custom logic you added). With a Task in the system you can always clone/modify and launch externally (i.e. from code/ui. Make sense ?
PipelineController creates another Task in the system, that you can later clone and enqueue to start a process (usually queuing it on the "services" queue)
BTW: is this on the community server or self-hosted (aka docker-compose)?
Any idea where that could come from? Could we turn off the local logging as well - in these kinds of runs we donโt need it?
It is supposed to create it automatically... I tested with other examples (clearml version 1.7.3rc1) everything seems to work
What am I missing? how do we recreate the issue ? can you verify it is still not working with the latest RC?
Hi @<1546303293918023680:profile|MiniatureRobin9>
This is the "regular" message when calling Dataset.get
without an alias.
This means the Dataset is not registered on the Task itself, just give it a name (i.e. pass the alias
argument to get
)
Hi ConvolutedChicken69
assuming you are runnign the agent in venv mode you can do something like:$ CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1 clearml-agent daemon --queue default
This will basically only clone the code and use the default python the clearml-agent itself is using.
Does that help?
BTW:
it gets an error as it can't find it with pip.
What's the error? how come the package cannot be installed ?
Hi LazyLeopard18
I remember someone deploying , specifically on the AZURE k8s (can't remember now how they call it).
What is exactly the feedback you are after?