Reputation
Badges 1
25 × Eureka!I see what you mean.an_optimizer = HyperParameterOptimizer( base_task_id='39d2c27baa8145929b2e21f686a17046', hyper_parameters=[], objective_metric_title='epoch_accuracy', objective_metric_series='epoch_accuracy', objective_metric_sign='max', optimizer_class=aSearchStrategy, max_iteration_per_job=0, total_max_jobs=0, auto_connect_task=False, ) print(an_optimizer.get_top_experiments(top_k=5))
Yes this seems like it is stuck, could you test with the demo server ?
(basically remove the clearml.conf it will connect automatically)
Hi LackadaisicalOtter14
However, whenever we spin up a session,Β
Β always gets run and overwrites our configs
what do you mean by that?
The what config are being overwritten? (generally speaking, it just add the OS environment it needs to for the setup process)
GloriousPenguin2 could you open a GitHub issue on it? Just making sure this will actually get fixed π
Yes that makes sense, if the overhead of the additional packages is not huge, I do not think it is worth the maintenance π
BTW clearml-agent has full venv caching that you can turn on, so when running remotely you are not "paying" for the additional packages being installed:
Un-comment this line π
https://github.com/allegroai/clearml-agent/blob/51eb0a713cc78bd35ca15ed9440ddc92ffe7f37c/docs/clearml.conf#L116
BTW:
Error response from daemon: cannot set both Count and DeviceIDs on device request.
Googling it points to a docker issue (which makes sense considering):
https://github.com/NVIDIA/nvidia-docker/issues/1026
What is the host OS?
The reason is because it is logged as an image, not a plot π
One example is a node that resizes the images, this node receives as input a Dataset and iterates over each image, resizes it an outputs a new Dataset, which is used in the next node downstream in the Pipeline.
I agree, this sounds like a "function" rather than a job, so better suited for Kedro.
organization structureΒ
Β and see for yourself (this pipeline has two nodesΒ
train_model
Β andΒ
predict
Β )
Interesting! let me dive into that and ...
So I might be a bit out of sync, but I think there should be Triton serving and OpenVino serving built into it (or at least in progress).
1724924574994 g-s:gpu1 DEBUG WARNING:root:Could not lock cache folder /root/.clearml/venvs-cache: [Errno 9] Bad file descriptor
You have an issue with your OS / mount, specifically "/mnt/clearml/" is the base folder for all the cached stuff and it fails to create the lock files there either use a Local folder or try to understand what is the issue with the Host machine /mnt/ mounts (because it looks like a network mount)
That's the right place but
like you would use hydra --override, which in your case I think it should be "accelerator.gpu" ,
You can also change allow_omegaconf_editin the UI to True, and then you could just edit the OmegaConf in the UI (if you do not changeallow_omegaconf_edit` then the edit in the UI is ignored)
Okay, so the idea behind the new decorator is not to group all the defined steps under the same script so that they share the same environment, but rather to simplify the process of creating scripts for each step and avoid manually callingΒ
Task.init
Β on those scripts.
Correct, and allow users to more easily create Tasks from code.
Regarding virtual environment creation from caching, I will keep running benchmarks (from what you say it might be due to high workload ...
Hi @<1523701295830011904:profile|CluelessFlamingo93>
What do you mean? what's the difference between ClearML server and self hosted? both are self hosted no?
We should probably change it so it is more human readable π
Hi MortifiedCrow63
I have to admit this is very strange, I think the fact it works for the artifacts and not for the model is kind of a fluke ...
If you use "wait_on_upload" argument in the upload_artifact you end up with the same behavior. Even if uploaded in the background, the issue is still there, for me it was revealed the minute I limited the upload bandwidth to under 300kbps.It seems the internal GS timeout assumes every chunk should be uploaded in under 60 seconds.
The default chunk...
I thought this is the issue on the thread you linked, did I miss something ?
Hi @<1684010629741940736:profile|NonsensicalSparrow35>
So sorry I missed this thread π
Basically your issue is the load balancer that prevents the post command, you can change that, just add to any clearml.conf the following line:
api.http.default_method: "put"
Hmm so I guess the actual code adds it into the reporting itself ...
How about we call:task.set_initial_iteration(0)
Hmm, interesting, why would you want that? Is this because some of the packages will fail?
SteadyFox10 I suspect you are correct π
CourageousLizard33 see also section (4) here:
https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md#launching-the-trains-server-docker-in-linux-or-macos
This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function
Hi UnevenHorse85
Interesting use case, just for my understanding, the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?
passes its address as argument to the function
This seems like a great solution.
the queu...
LazyFish41 just making sure, you built a container from the docker file, and used it as base docker image for the Task, is that correct ?
Also notice the cleaml-agent will not change the entry point of the docker meaning if the entry point does not end with plain bash, it will not actually run anything
Try to manually edit the "Installed Packages" (right click the Task, select "reset", now you can edit the section)
and change it to :-e git+ssh@github.com:user/private_package.git@57f382f51d124299788544b3e7afa11c4cba2d1f#egg=private_package(assuming " pip install -e mailto:git+ssh@github.com :user/... " will work, should solve the issue )
Hi WickedGoat98
Will I need to wrap their execution in python by system calls?
That would probably be the easiest solution π
Then you can plug it into your pipeline as a preprocessing Task:
You can check this example:
https://github.com/allegroai/trains/tree/master/examples/pipeline
Hi MortifiedCrow63 , thank you for pinging! (seriously greatly appreciated!)
See here:
https://github.com/googleapis/python-storage/releases/tag/v1.36.0
https://github.com/googleapis/python-storage/pull/374
Can you test with the latest release, see if the issue was fixed?
https://github.com/googleapis/python-storage/releases/tag/v1.41.0
GrievingTurkey78 sure, aws autoscaler can do that:
https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py