Reputation
Badges 1
25 × Eureka!GrotesqueDog77 one issue with this design, in order to run a sub-component, the call must be done from the parent component, does that make sense?
` def step_one(data):
return data
def step_two(path):
return model
def both_steps()
path = step_one("stuff")
return step_two(path)
def pipeline():
both_steps() Which would make
both_steps ` a component and step_one and step_two sub-components
wdyt?
If you take a look here, the returned objects are automatically serialized and stored on the files server or object storage, and also deserialized when passed to the next step.
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
You can of course do the same manually
But every agent is a different pod so I do not know how properly share the folder with images.
Can I conclude Kubernetes running the agents ?
Yes, but I'm not sure that they need to have separate task
Hmm okay I need to check if this can be easily done
(BTW, the downside of that, you can only cache a component, not a sub-component)
ContemplativePuppy11
yes, nice move. my question was to make sure that the steps are not run in parallel because each one builds upon the previous one
if they are "calling" one another (or passing data) then the pipeline logic will deduce they cannot run in parallel 🙂 basically it is automatic
so my takeaway is that if the funcs are class methods the decorators wont break, right?
In theory, but the idea of the decorator is that it tracks the return value so it "knows" how t...
you should have a gpu argument there, set it to true
when I run it on my laptop...
Then yes, you need to set the default_output_uri
on Your laptop's clearml.conf (just like you set it on the k8s glue)
Make sense ?
No worries, just found it. Thanks!
I'll make sure to followup on the GitHub issue for better visibility 🙂
but does that mean I have to unpack all the dictionary values as parameters of the pipeline function?
I was just suggesting a hack 🙂 the fix itself is transparent (I'm expecting it to be pushed tomorrow), basically it will make sure the sample pipeline will work as expected.
regardless and out of curiosity, if you only have one dict passed to the pipeline function, why not use named arguments ?
"sub nodes" inside pipeline, in my opinion, makes them much more useful, in sense that all the steps are visible.
Yeah I really like this idea... continuing this thread, would it also make sense to have a Task object per "sub-node" and run the sub-nodes as subprocess of the parent Node? I'm thinking this sounds like a combination of both local pipeline execution and remote pipeline execution.
wdyt?
GrotesqueDog77 when you say "the second issue" , do you mean the fact that both step 1 and step 2 should have access to the same filesystem?
how would I get an agent to launch in the same instance of my clearml server
Actually that is my point, you do not have to spin the agent on the clearml-server instance. We added the services agent as part of the docker-compose for easier deployment, that said you can always manually SSH to the server, or run on any other machine, like you would spin any other clearml-agent
.
Does that make sense ?
Sounds good to me, adding it to the to do list, probably should not be very complicated to add 🙂
GrotesqueDog77 this should just work, decorate the functions with @PipelineDecorator.component
and call the functions one after the otherpaths = step_one() step_two(paths)
ClearML will make sure it serializes the strings and pass them to step two (of course step two should actually run on a machine with access to the same folder, but this is another issue 🙂 )
If this is the case, then you have to set a shared PV for the pods, this way they can actually have a persistent cache, which would also be shared.
BTW: a single function call might not be a perfect match for a pipeline component , the overhead of starting a node might not be negligible as it needs to install required python packages bring the code etc.
Ohh, clearml is designed so that you should not worry about that, download_dataset = StorageManger.get_local_copy()
this is cashed, meaning the machine that runs that like the second time will not re download the path.
This means step 1 is redundant, no?
Usually when data is passed between components it is automatically uploaded as artifact to the Task (stored on the files server or object storage etc.) then downloaded and passed to the next steps.
How large is the data that you are wo...
because step can be constructed with multiple
sub-components
but not all of them might be added to the UI graph
Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?
Makes total sense!
Interesting, you are defining the sub-component inside the function, I like that, this makes the code closer to how this is executed!
BTW:
Error response from daemon: cannot set both Count and DeviceIDs on device request.
Googling it points to a docker issue (which makes sense considering):
https://github.com/NVIDIA/nvidia-docker/issues/1026
What is the host OS?
Hi GrotesqueDog77
What do you mean by share resources? Do you mean compute or storage?
DistressedGoat23 check this example:
https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.pyaSearchStrategy = RandomSearch
It will collect everything on the main Task
This is a curial point for using clearml HPO since comparing dozens of experiments in the UI and searching for the best is just not manageable.
You can of course do that (notice you can actually order them by scalars they report, and even do ...
Hi @<1572395184505753600:profile|GleamingSeagull15>
Try adjusting:
None
to 30 sec
It will reduce the number of log reports (i.e. API calls)
AbruptWorm50 my apologies I think I mislead you you, yes you can pass geenric arguments to the optimizer class, but specifically for optuna, this is disabled (not sure why)
Specifically to your case, the way it works is:
your code logs to tensorboard, clearml catches the data and moves it to the Task (on clearml-server), optuna optimization is running on another machine, trail valies are maanually updated (i.e. the clearml optimization pulls the Task reported metric from the server and updat...
there is almost zero overhead if your docker container alreadyt has everything (including the agent) preinstalled and you set it with CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
it then should basically just run the code.
There is some overhead, but it should be negligible.
I think the limit is a few GB, I'm not sure, I'll have to check
And yes the oldest experiments will be deleted first (with the exception of published experiments, they will be deleted last)
Okay we have located the issue, thanks guys! We will push a patch release hopefully later today
Hi IntriguedRat44
Sorry, I missed this message...
I'm assuming you are running in manual mode (i.e. not through the agent), in that case we do not change the CUDA_VISIBLE_DEVICES.
What do you see in the resource monitoring? Is it a single GPU or multiple GPUs?
(Check the :monitor:gpu in the Scalar tab under results,)
Also what's the Trains/ClearML version you are suing and the OS ?
you can also just create a venv and run the tests there (with the latest python package) ?
Ohh then YES!
the Task will be closed by the process, and since the process is inside the Jupyter and the notebook kernel is running, it is still running