(Do notice that even though you can spin two agents on the same GPU, the nvidia drivers cannot share allocated GPU memory, so if one Task consumes too much memory the other will not have enough free GPU memory to run)
Basically the same restriction as manually launching two processes using the same GPU
By default the pl Trainer will output everything to TB, which we automatically store. But verify that TB is installed
- Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebaseNo you an specify a different code base, see here:
None - The component code still needs to be self-composed (or, function component can also be quite complex)Well it can address the additional repo (it will be automatically added to the PYTHONPATH), and you c...
I always have my notebooks in git repo but suddenly it's not running them correctly.
What do you mean?
Can I switch off git diff (change detection?)
Yes, Task.init(..., auto_connect_frameworks={"detect_repository": False})
Have a wrapper over Task to ensure S3 usage, tags, version number etc and project name can be skipped and it picks from the env var
Cool. Notice that when you clone the Task and the agents executes it, the project is already defined, so this env variable is meaningless, no ?
Hi ResponsiveCamel97
Let me explain how it works, essentially it creates a new venv inside the docker, inheriting all the packages form the main system packages.
This allows it to use the installed packages if the version match, and upgrade/change if you need, all without the need to rebuild a new container. Make sense ?
You can try just pulling the "metric" section of the Task, but I cannot imaging the network bandwidth is the issue?
Could it be load on the clearml-server (i.e. it needs to handle lots of requests ?)
this?ids = [t.id for t in top_task]
Hmm check if this one works:optimizer._get_child_tasks_ids( parent_task_id=optimizer._job_parent_id or optimizer._base_task_id, order_by=optimizer._objective_metric._get_last_metrics_encode_field(), additional_filters={'page_size': int(top_k), 'page': 0})
If it does, let's PR it as a dedicated function
Hi IrateBee40
What do you have in your ~/clearml.conf
?
Is it pointing to your clearml-server ?
No worries π glad it worked
or creating a dedicated function I would suggest also including the actual sampled point in the HP space.
Could you expand ?
This would be the most common use case, and essentially the reason for running the HPO understanding the sensitivity of metrics with respect to hyper-parameters
Does this relates to:
https://github.com/allegroai/clearml/issues/430
manually" filtering the keys I've put in for the HP space. I find it a bit strange that they are not saved as part of t...
Hi LovelyHamster1
That is a good point, sine the Pipeline kind of assumes the task are already in the system, it clone them (leaving you with the original Draft Task).
I think we should add a flag to that pipeline that if the Task is in draft it will use it (instead of cloning it) Since it seems your pipeline is quite straight forward, I'm not sure you actually need the pipeline controller class, you can perform the entire thing manually, see example here: https://github.com/allegroai/clea...
MysteriousBee56 I would do Task.create()
you can get the full Task internal representation with task.data
Then call task._edit(script={'repo': ...}) to edit/update all the Task entries.
You can check the dull details of the task object here: https://github.com/allegroai/trains/blob/master/trains/backend_api/services/v2_8/tasks.py#L954
BTW: when you have a sample script working, consider PR-ing it, I'm sure it will be useful for others π (also a great way to get us involved with debuggin...
Hi ConvolutedSealion94
You can archive / delete the SERVING-CONTROL-PLANE
Task from the DevOps project in the UI.
Do notice you will need to make sure the clearml-serving is updated with a new sesison ID or remove it (i.e. take down the pods / docker-compose)
Make sense ?
Were you able to interact with the service that was spinned? (how was it spinned?)
If i were to push the private package to, say artifactory, is it possible to use that do the install?
Yes that's the recommended way π
You add the private repo here, for the agent to use:
https://github.com/allegroai/clearml-agent/blob/e93384b99bdfd72a54cf2b68b3991b145b504b79/docs/clearml.conf#L65
are you referring toΒ
extra_docker_shell_
scrip
t
Correct
the thing is that this runs before you create the virtual environment, so then in the new environment those settings are no longer there
Actually that is better, because this is what we need to setup the pip before it is used. So instead of passing --trusted-host
just do:
` extra_docker_shell_script: ["echo "[global] \n trusted-host = pypi.python.org pypi.org files.pythonhosted.org YOUR_S...
Hi LovelyHamster1
As you noted, passing overrides in Args/overrides
, for example ['training.max_epochs=1000']
should work when running with the agent.
Could you verify with the latest RC, there was a fix to support the latest hydra versionpip install clearml==0.17.5rc5
However, are you thinking of including this callbacks features in the new pipelines as well?
Can you see a good use case ? (I mean the infrastructure supports it, but sometimes too many arguments is just confusing, no?!)
Hi WackyRabbit7 ,
Running in Docker mode provides you greater flexibility in terms of environment control, from switching cuda versions, to pre-compiled packages that are needed (think apt-get) etc. Specifically for DL if you are using multiple tensorflow versions, they are notorious for compiling against a specific CUDA version, and the only easy way to be able to switch between them would be different dockers. If your are a PyTorch user, then you are in luck, they have all the pytorch ver...
WackyRabbit7 if this is a single script running without git repo, you will actually get the entire code in the uncommitted changes section.
Do you mean get the code from the git repo itself ?
So when the agent fire up it get's the hostname, which you can then get from the API,
I think it does something like "getlocalhost", a python function that is OS agnostic
I think I found something, let me test my theory
The --template-yaml allows you to use foll k8s YAML template (the overrides is just overrides, which do not include most of the configuration options. we should probably deprecate it
Hmm, I think you should use --template-yaml
ExcitedFish86 this is a general "dummy agent" that tasks and executes them (no env created, no code cloned, as you suggested)
hows does this work with HPO?
The HPO clones Tasks, changes arguments, push them into a queue, and monitors the metrics in real time. The missing part (from my understanding) was the the execution of the Tasks themselves required setup, and that you wanted multiple machine support, in order to overcome it, I post a dummy agent that just runs the Tasks.
(Notice...
Hi @<1689808977149300736:profile|CharmingKoala14> , let me double check that
Assuming git repo looks something like:.git readme.txt module | +---- script.py
The working directory should be "."
The script path should be: "-m module.scipt"
And under the Configuration/Args, you should have:args1 = value args2 = another_value
Make sense?