Reputation
Badges 1
25 × Eureka!TenseOstrich47 it's based on free "index" so the first index not in used will be captured, but if you remove agents, then the order will change e.g. you take down worker #1 , the next worker you spin will be #1 becuase it is not taken)
pipeline, can I control the tags that the tasks a pipeline creates?Β
add_pipeline_tags
Β adds tags from pipeline to the tasks I suppose? But I also need to clear existing tags in those created tasks
add_pipeline_tags
will add the unique ID of the pipeline execution, if you want to add specific tags you can use the task_overrides
and provide:pipe.add_step(..., task_overrides={'tags': ['my', 'tags']})
Sorry ScaryLeopard77 I missed the reply,
the tutorial in the readme of clearml-serving repo doesn't mention it though. Where should I set it?
oh dear ... you are right (I think it was there in previous versions)clearml-serving --help
https://github.com/allegroai/clearml-serving/blob/ce6ec847b1e01c6f5bf35d638e6ceb8148db8a7a/clearml_serving/main.py#L142
This is the equivalent of what is created here in the example:
https://github.com/allegroai/clearml-serving/blob/ce6ec847b...
TrickySheep9
Is there a way to see a roadmap on such thingsΒ
?Β (edited)
Hmm I think we have some internal one, I have to admit these things change priority all the time (so it is hard to put an actual date on them).
Generally speaking, pipelines with functions should be out in a week or so, TaskScheduler + Task Triggers should be out at about the same time.
UI for creating pipelines directly from the web app is in the working, but I do not have a specific ETA on that
the other repos i have are constantly worked on and changing too
Not only it will be cloned automatically, the git diff of the sub-modules are stored as well π
That wasn't scheduled by ClearML).
This means that from Clearml perspective they are "manual" i.e the job it self (by calling Task.init) create the experiment in the system, and fills in all the fields.
But for a k8s job, I'm still unsuccessful.
HelpfulDeer76 When you say "unsuccessful" what exactly do you mean ?
Could it be they are reported to the clearml demo server (the default server if no configuration is found) ?
Clearml 1.13.1
Could you try the latest (1.16.2)? I remember there was a fix specific to Datasets
YummyWhale40 no idea what the pytorch-lighting guys did there. let me check a the actual code.
Hi ReassuredOwl55
How would I find Tasks that have the same code with different inputs/parameters?
Assuming you have the git repo
you can do:Task.query_tasks(..., task_filter={'_all_'=dict(fields=['script.repository'], pattern='github.com/user/repo'))
wdyt?
Was wondering how it can handle 10s, 100s of models.
Yes, it supports dynamically loading/unloading models based on requests
(load balancing multiple nodes is disconnected from it, but assuming they are under diff endpoints, the load balancer can be configured to route accordingly)
okay this seems like a broken pip install python3.6
Can you verify it fails on another folder (maybe it's a permissions thing, for example if you run in docker mode, then the permissions will be root, as the docker is creating those folders)
ClumsyElephant70 the odd thing is the error here:docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.
I would imagine it will be with "nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04" but the error is saying "nvidia/cuda:latest"
How could that be ?
Also can you manually run the same command (i.e. docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash
)?
GleamingGrasshopper63 can you ping to your api server ?!ping api.server.here
Also what's the api server you configured ? (ip:8008 ?)
Any chance this is a Local machine, i.e. the colab machine cannot get back into the clearml server cunning locally ?
Hi JoyousElephant80
Another possibility would be to run a process somewhere that periodically polls ClearML Server for tasks that have recently finished
this is the easiest way to implement what you are after, and have full control over the logic itself.
Basically you inherit from the Monitor class
And implement the callback function:
https://github.com/allegroa...
AdventurousRabbit79 you mean like minio / ceph ?
But itβs running in docker mode and it is trying to ssh into the host machine and failing
It is Not sshing to the machine it is sshing directly Into the container.
Notice the port is is sshing to is 10022 which is mapped into the container
(as i see the services worker is only in the services-queue, and not my default queue (where my other servers/workers are)
So basically the service-mode is just a flag passed to the agent, and the services queue is the name of the queue it will pull from.
If i want a normal worker also
You can just add another section to the docker-compose, or run it manually after you spin the docker-compose.
LazyFox65 wdyt ?
Thanks!
In the conf file, I guess this will be where ppl will look for it.
the question remains though: why docker containers won't launch onΒ
services
Maybe something with the way it launched on the docker-compose?
(I'm assuming it will fail on any docker container regardless, right?!)
Hi @<1619505588100665344:profile|GrievingHare27>
My understanding is that initiating a task with
Task.init()
captures the code for the entire notebook. I'm facing difficulties when attempting to build a final training pipeline (in a separate notebook) that uses only certain functions from the other notebooks/tasks as pipeline steps.
Well this is is kind of the limit of working with jupyter notebooks, referencing code from one to another is not really feasible (of co...
If Task.init() is called in an already running task, donβt reset auto_connect_frameworks? (if i am understanding the behaviour right)
Hmm we might need to somehow store the state of it ...
Option to disable these in the clearml.conf
I think this will be to general, as this is code specific , no?
Do people use ClearML with huggingface transformers? The code is std transformers code.
I believe they do π
There is no real way to differentiate between, "storing model" using torch.save
and storing configuration ...
for example, one notebook will be dedicated to explore columns, spot outliers and create transformations for specific column values.
This actually implies each notebook is a standalone "process", which makes a ton of sense. But this is where notebooks and proper SW design break, in traditional SW, the notebooks are actually python files, and then of course you can import one from another, unfortunately this does not work in notebooks...
If you are really keen on using notebooks I wou...
I see...
Current (and this will change soon) the entire delta is stored in a single file, so there is no real way to download a "subset" of the data, only a parent version π
Lets say that this small dataset has a ID ....
Yes this would be exactly the way to do so:
` param ={'dataset': small_train_dataset_id_here}
task.connect(param)
dataset_folder = Dataset.get(param['dataset']).get_local_copy()
... Locally it will use the
small_train_dataset_id_here ` , then whe...
Hi @<1597762318140182528:profile|EnchantingPenguin77>--ipc=host
actually means that there is no need for the --shm-size
argument, it means you have access to the enitre GPU ram on the host machine. I'm assuming that the GPU card just does not have enough VRAM ...
None
Since I can't use the
torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the
@<1556450111259676672:profile|PlainSeaurchin97> did you check this example?
None
Thanks VivaciousPenguin66 !
BTW: if you are running the local code with conda, you can set the agent to use conda as well (notice that if you are running locally with pip, the agent's conda env will use pip to install the packages to avoid version mismatch)