Then we can figure out what can be changed so CML correctly registers process failures with Hydra
JumpyPig73 quick question, the state of the Task changes immediately when it crashes ? are you running it with an agent (that hydra triggers) ?
If this is vanilla clearml with Hydra runners, what I suspect happens is Hydra is overriding the signal callback hydra adds (like hydra clearml needs to figure out of the process crashed), then what happens is that clearml's callback is never cal...
MoodyCentipede68 can you post the full docker-compose log (from spinning it until you get the error?)
You can just pipe the output to a file with :docker-compose ... up > log.txt
That said, you might have accessed the artifacts before any of them were registered
Well from the error it seems there is no layer called "dense" , hence triton failing to find the layer returning the reult. Does that make sense?
Hi CheerfulGorilla72
the "installed packages" section is used as "requirements.txt for the agent.
Are you saying the autodetection fails to detect all packages? You can specify in "manual execution" (i.e not when the agent is running the code), to just take the requirements.txt locally:` Task.force_requirements_env_freeze(requirements_file="./requirements.txt")
notice the above call should be executed Before Task.init
task = Task.init(...) `3. If you clear all the "installed packages" se...
SarcasticSquirrel56
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist?
Basically the agent register themselves on your cleaml-server, and they register on which Queue(s) they listen to. In other words the interface to choose the different types of machines/gpus is by enqueue the Task to different queues.
For example: Queue(1): "CUDA11_GPUx1" , Queue(2): "CUDA10_GPUx1"
Make sense ?
EDIT:
I guess to achieve what I w...
So that agent on different nodes will probably require different cuda-version images.
That makes sense SarcasticSquirrel56
I would edit the helm chart (or deploy manually) based on a selector that will select the different nodes/gpus and assign the correct containers (i.e. matching CUDA versions to the diff GPUs / drivers)
BTW: you can also playaround with k8s glue, which would dynamically spin pods based on clearml Tasks.
wdyt?
Correct, (if this is running on k8s it is most likely be passed via env variables , CLEARML_WEB_HOST etc,)
Wait, that makes no sense to me. The API from python and the API from the UI are getting the same data from the backend ...
What are you getting with?from clearml import Task task = Task.get_task(task_id=<put task id here>) print(task.models)
š CooperativeFox72 please see if you can send a code snippet to reproduce the issue. I'd be happy to solve the it ...
Hi AstonishingRabbit13
now Iām training yolov5 and i want to save all the info (model and metrics ) with clearml to my bucket..
The easiest thing (assuming you are running YOLOv5 with python train.py
is to add the following env variable:CLEARML_DEFAULT_OUTPUT_URI="
" python train.py
Notice that you need to pass your GS credentials here:
https://github.com/allegroai/clearml/blob/d45ec5d3e2caf1af477b37fcb36a81595fb9759f/docs/clearml.conf#L113
the use case i have is to allow people from my team to run their workloads on set of servers without stepping over each other..
So does that mean CPU only workloads?
Also are we afraid of fairness? (i.e. someone "taking" all the CPU for themselves)
okay the odd thing git ls-remote --get-url origin
should have returned the same...
what's your git version? (git --version)
Oh i get it now, can you test:git ls-remote --get-url github
and thengit ls-remote --get-url
Can you verify it fixes the timeout issue as well? (or some insight on how to reproduce the issue?)
Hi DepressedFox45
Basically move the import into the function, it will automatically detect the package.@PipelineDecorator.component(...) def step_one(...): import sklearn import pandas as pd # stuff
Make sense ?
you can also specify additional packages on the decorator@PipelineDecorator.component(..., packages=["tqdm>=2.1", "scikit-learn"]) def step_one(...): # code here
So does that mean "origin" solves the issue ?
LOL, thanks!
What sort of data would be stored in the
venvs-build
folder?
ClumsyElephant70 temporary (lifetime of the task execution) virtual environment, including the code etc. It is deleted and recreated for every new task launched (or restored from cache, if venvs_cache is enabled)
@<1571308003204796416:profile|HollowPeacock58> seems like an internal issue copying this object config.model
This is a complex object, and it seems that for some reason
None
As a workaround just do not connect this object. it seems you cannot pickle it / copy it (see GH issue)
Hi GiddyTurkey39 ,
When you say trains agent, are you referring to the trains agent command ...
I mean running the trains-agent daemon
on a machine. This means you have a daemon pulling jobs from the execution queue and executing them (either in virtual environment, or inside a docker)
You can read more about https://github.com/allegroai/trains-agent and https://allegro.ai/docs/concepts_arch/concepts_arch/
Is it sufficient to queue the experiments
Yes there is no ne...
model upload and registration i should pass something like
'xgboost': False
or
'xgboost': False, 'scikit': False
?
Exactly! which framework are you using ?
about 2, I refer to the names of the models.
Hmm that is a good point to test, usually this is based on the Task name (I think), so if the Task name contains the HPO params in the name it should be the same on the model name. Do you see the HPO params on the Task name ? Should we open a Gi...
Before this line, call Task.init
The second problem that I am running into now, is that one of the dependencies in the package is actually hosted in a private repo.
Add your private repo to the extra index section in the clearml.conf:
None
Which version? is this reproducible in this example?
None
(can you try with the latest clearml version 1.13.2?)
I saw documentation, but I can't make the proper dict object for hyperparams
I see, this is what you are after (I think)
https://github.com/allegroai/clearml/blob/fb644fe9ec6be36b8f2f70a34256fbdc593d663a/clearml/backend_api/services/v2_20/tasks.py#L3138