
Reputation
Badges 1
25 × Eureka!wdym 'executed on different machines'?The assumption is that you have machines (i.e. clearml-agents) connected to clearml, which would be running all the different components of the pipeline. Think out of the box scale-up. Each component will become a standalone Job and the data will be passed (i.e. stored and loaded) automatically on the clearml-server (can be configured to be external object storage as well). This means if you have a step that needs GPU it will be launched on a GPU machine...
Yes (Mine isn't and it is working π )
Hi WorriedParrot51 , what do you mean by "call get_parameters_as_dict() from agent" ?
Do you mean like change the trains-agent to run the task differently?
Or inside your code while the trains agent runs it?
From the code itself (regardless off how you run it) you can always call, and get the current states parameters (i.e. from backend if running with trains-agent, or copied from the code, if running manually)task.get_parameters_as_dict()
Hi CrookedAlligator14
or is underlying data also accessible?
What do you mean by "underlying data" ?
that's the entire repo link ? not something like https://github.com/ ... ?
Hmm, it might be sub-sampling on large scalar plots (so that we do not "kill" the ui), but I remember that it only happens above 50k samples. (when you zoom in, do you still get the 0.5 values?)
should reload the reported scalars
Exactly (notice it also understand when was the last report of scalars so it should automatically increase the iterations (i.e. you will not accidentally overwrite previously reported scalars)
and the task needs to reload last checkpoints only, right?
Correct π
We didn't figure out the best way of continuing for both the grid and optuna. Can you suggest something?
That is a good point, not sure if we have a GH issue, for that but wo...
What's the Windows version, python version, clearml version, you are using ?
the hack doesn't work if conda is not installedΒ
Of course conda needs to be installed, it is using a pre-existing conda env, no?! what am I missing
Ideally it would just pull an experiment from a dedicated HPO queue and run it inplace
And the assumption is the code is also there ?
I basically moved the Task.init() call below the imports
Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!
Great, you can test directly from the master πpip3 install -U git+
It might be broken for me, as I said the program works without the offline mode but gets interrupted and shows the results from above with offline mode.
How could I reproduce this issue ?
But there might be another issue in between of course - any idea how to debug?
I think I missed this one, what exactly is the issue ?
So would this pseudo code solve the issue
def pipeline_creator():
pipeline_a_id = os.system("python3 create_pipeline_a.py")
print(f"pipeline_a_id={pipeline_a_id}")
something like that?
(obviously the quesiton how would you get the return value of the new pipeline ID, but I'm getting a head of myself)
For now we've monkey-patched it to our usecase:
LOL, that's a cool hack
That gives us the benefit of creating "local datasets" (confined to the scope of the project, do not appear in
Datasets
tabs, but appear as normal tasks within the project)
So what would be a "perfect" solution here?
I think I'm missing the point on why it became an issue in the first place.
Notice that in new versions Dataset will be registered on the Tasks that use them (they are already...
ReassuredTiger98 that is a good point, at the moment they are designed as "machine level" configs, but we do have built in support to allow multiple configurations. The technical issue is we have to read the configuration file before we initial the Task object, that means we still are not aware of the git root (which I assume is where we could put a configuration file)
BTW: regrading the detect_with_conda_freeze
we hope that this flag is rarely used, as the Clearml should auto-detect t...
the optimizer such that the study object of the optimizer keeps track of the results and the next sample will be aware of all previous studies
This is done from the optimizer side, by sampling the scalars reported by any experiment the optimizer created.
I am looking for a way to manually sample and report from and to the optimizer...
.. I can avoid running unnecessary common heavy setup, for a light weight experiment
Maybe it makes sense to inherit from the Optimizer and add ...
instead of the one that I want or the one of the env which it is started from.
The default is the python that is used to run the agent.agent.ignore_requested_python_version = true agent.python_binary = /my/selected/python3.8
(This code sample should work on your setup with your installed packages without a problem)
- Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebaseNo you an specify a different code base, see here:
None - The component code still needs to be self-composed (or, function component can also be quite complex)Well it can address the additional repo (it will be automatically added to the PYTHONPATH), and you c...
Hi UnevenDolphin73
In theory it "might" work, I have to admit that personally I'm not a fan of what Amazon did to Mongo, i.e. forking their their code base and selling it as a service, just bad open-source practice
(The main issue might be API calls that might not fully match)
wdyt?
Seems like everything is in order. Can you curl to the API/web/files server?
Queues can have multiple workers, and that implies multiple instances of a task can run concurrently.
@<1533619716533260288:profile|SmallPigeon24> as long as these are the Exact same instances you can have them runing simultaneously (think multi node training), that said each one should "know" not to report over the others, because of course it will overwrite the reports.
Back to your point on multiple agents:
You cannot have two Tasks in the same queue, that means that a single agen...
ScaryKoala63
When it fails what's the number of files you have in:/home/developer/.clearml/cache/storage_manager/global/
?
Is this consistent on the same file? can you provide a code snippet to reproduce (or understand the flow) ?
Could it be two machines are accessing the same cache folder ?
Finally managed; you keep saying "all projects" but you meant the "All Experiments" project instead. That's a good startΒ
Β Thanks!
Yes, my apologies you are correct: "all experiments"
Thanks TroubledHedgehog16 for the context.
sdk.development.worker.report_period_sec
Yes please update to the latest version 1.8.0 for full support (to be released today, I think)
https://github.com/allegroai/clearml/blob/f6238b8a0fb662540bca9095cc0c22bd7af483c1/docs/clearml.conf#L196
https://github.com/allegroai/clearml/blob/f6238b8a0fb662540bca9095cc0c22bd7af483c1/docs/clearml.conf#L199
we have have been running agents on 3 on-premise systems.
Do notice that by default an...
. Looking at this example here, it looks like it only works with tasks:
Aha! Pipeline is a Task π (a specific type of Task, nonetheless a Task)
Just use the pipeline ID, and make sure you push it into the services queue, voila