In terms of creating dynamic pipelines and cyclic graphs, the decorator approach seems the most powerful to me.
Yes that is correct, the decorator approach is the most powerful one, I agree.
DilapidatedDucks58
all our workers went down after starting the slack bot, is it expected?)
Oh dear... I can;t see any connection... What is the last log you have there?
ReassuredTiger98 do you know if tensorboard (not tensorboardX) also supports gif there ?
You can already sort and filter experiments based on any hyper parameter or metric that the experiment reports, there is no need for any custom language query. Also all created filter/sorted table can be shared exactly as they are, so you can create leaderboards and share specific filters. You can also use the search bar in order to filter based on experiment name / comment. Tags will be added soon as well 🙂
Example of custom columns is here (the screen grab is a bit old, now there is als...
StaleButterfly40 just making sure I understand, are we trying to solve the "import offline zip file/folder" issue, where we create multiple Tasks (i.e. Task per import)? Or are you suggesting the Actual task (the one running in offline mode) needs support for continue-previous execution ?
If this is the case I would do:
` # Add the collector steps (i.e. the 10 Tasks
pipe.add_task(...
post_execute_callback=Collector.collect_me
)
pipe.start()
pipe.wait()
Collector.process_results(pipe) `wdyt?
I'm assuming these are the Only packages that are imported directly (i.e. pandas requires other packages but the code imports pandas so this is what listed).
The way ClearML detect packages, it first tries to understand if this is a "standalone" scrip, if it does, than only imports in the main script are logged. Then if it "thinks" this is not a standalone script, then it will analyze the entire repository.
make sense ?
Hi CloudySwallow27
This error occurs randomly during training (in other words training does successfully start).
What's the cleamrl-agent version you are using, and the clearml version ?
Hi ScantChimpanzee51
How are you launching the code ?
Basically the easiest way is to do so with the example you just mentioned,
Can this issue be reproduced ?
Yep, found it, the --name is marked as required and the argparser throws an error ...
I'll make sure this is fixed as well 🙂
Hmm worked now...
When Task.init called with output_uri='
s3://my_bucket/sub_folder '
s3://my_bucket/sub_folder/examples/upload issue.4c746400d4334ec7b389dd6232082313/artifacts/test/test.json
sorry the point where you select the interpreter for pycharm
Oh I see...
I think they (DevOps) said something about next week, internal roll-out is this week (I think)
Thanks SolidSealion72 !
Also, I found out that adding "pool.join()" after pool.close() seem to solve the issue in the minimal example.
This is interesting, I'm pretty sure it has something to do with the subprocess not "closing" properly (or too fast or something)
Let me see if I can reproduce
The imports inside the functions are because the function itself becomes a stand-alone job running on a remote machine, not the entire pipeline code. This also automatically picks packages to be installed on the remote machine. Make sense?
Hmm interesting, I guess once you are able to connect it with ClearML you can just clone / modify / enqueue and let users train models directly from the UI on any hardware, is that the plan ?
LazyTurkey38 notice the assumption is that the docker entry-point ends with bash, and only then the agent take charge. I'm assuming this is not te case hence the agent spins the docker, then the docker just ends, could that be?
SmallDeer34 the function Task.get_models() incorrectly returned the input model "name" instead of the object itself. I'll make sure we push a fix.
I found a different solution (hardcoding the parent tasks by hand),
I have to wonder, how does that solve the issue ?
Any recommended way to make a task/pipeline “pause” until some external condition is met?
RoughTiger69 I would setup a trigger on the Dataset (i.e. new version)
https://github.com/allegroai/clearml/blob/df3d3b269acd2df0f31bfe804eb54ddc84d807c0/examples/scheduler/trigger_example.py#L44
wdyt?
Is there any way to make that increment from last run?
pipeline_task = Task.clone("pipeline_id_here", name="new execution run here")
Task.enqueue(pipeline_task, queue_name="services")
wdyt?
I'm getting:hydra_core == 1.1.1
What's the setup you have? python version, OS, Conda yes/no?
Hi LazyLeopard18 ,
So long story short, yes it does.
Longer version, to really accomplish full federated learning with control over data at "compute points" you need some data abstraction layer. Without data abstraction layer, federated learning is just averaging derivatives from different location, this can be easily done with any distributed learning framework, such as horovod pr pytorch distributed or TF distributed.
If what you are after is, can I launch multiple experiments with the sam...
Hi ProudMosquito87
so you mean to mount your data folder onto the the docker so that the code could access it, correct?
If that is the case, is there a specific version not to use absolute path? (e.g. /mnt/data/mine
)?
so it would be better just to use the original code files and the same conda env. if possible…
Hmm you can actually run your code in "agent mode" assuming you have everything else setup.
This basically means you set a few environment variables prior to launching the code:
Basically:export CLEARML_TASK_ID=<The_task_id_to_run> export CLEARML_LOG_TASK_TO_BACKEND=1 export CLEARML_SIMULATE_REMOTE_TASK=1 python my_script_here.py
Hi JuicyFox94
you pointed to exactly the issue 🙂
In your trains.conf
https://github.com/allegroai/trains/blob/f27aed767cb3aa3ea83d8f273e48460dd79a90df/docs/trains.conf#L94
Yes this is Triton failing to load the actual model file