Reputation
Badges 1
25 × Eureka!MysteriousBee56 what do you mean "delete a worker"
stop the agent running remotely ?
There seems to be a problem with multiprocessing: Although I stopped the task,
You mean you "aborted the task" from the UI?
- There is a memory leak somewhere, please see the screenshot of datadog memory consumptionI'm assuming from the leftover processes ?
Python 3.8/Pytorch 1.11/clearml-sdk 1.9.0/clearml-agent 1.4.1
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
But I am starting to wonder whether It would be easier just changing sys,path on the scripts that use the sibling libs.
that depends, how would the sibling packages get to a remote machine ?
Right, so this "vault" design is built into the paid tiers of ClearML to achieve exactly that. Long story short, users can put their credentials/configs on the clearml-server and the agent (or the clients) will pull and merge them into the execution.
It's very cool and works really nice, but not part of the open source (or the SaaS tier).
What you could do is store these configurations on the Task itself (one way o r another). Maybe for example have an empty definitions.py
file part of ...
Thanks StrongHorse8
Where do you think would be a good place to put a more advanced setup? Maybe we should add an entry for DevOps? Wdyt?
Oh!
I see this is using the colab as remote agent (i.e. to launch jobs on it),
[ColabKernelApp] CRITICAL | Bad config encountered during initialization: The 'kernel_class' trait of <main.ColabKernelApp object at 0x7fa41b29e5c0> instance must be a type, but 'google.colab._kernel.Kernel' could not be imported
Can you send the full log?
Hi @<1686547344096497664:profile|ContemplativeArcticwolf43>
In the 2nd 'Getting Started' tutorial,
Could you send a link to the specific notebook?
. But whenever a task is picked, it fails for the following
You mean after the Task.init
call?
I think it is only in get_task
(and by default it is true)
I think query task does not filter the
Hi SarcasticSparrow10
I think the default search is any partial match, let me check if there is a way to do some regexp / wildcard
Actually what my service do is to collect
stdout/stderr
from the Docker socket
That's exactly how the agent works, it cannot really filter it, it logs everything by default for full visibility ...
Hi ColossalAnt7
Try ctrl-F5 and refresh the page?!
It seems you are missing a few buttons π
Are you inheriting from their docker file ?
LOL I keep typing clara without noticing (maybe it's the nvidia thing I keep thinking about)
Carla makes much more sense π
ReassuredTiger98 quick update, the issue was located, next RC will already contain a fix.
In the mean time, you can avoid it by using limiting pip version:
https://github.com/allegroai/clearml-agent/blob/715f102f6d98a44131d5bee909ee779b456c6229/docs/clearml.conf#L67pip_version: "<20.2"
SoreDragonfly16
btw: The difference between the two graphs is the ratio pf the graph display , that it π
what format should I specify it
requirements.txt format e.g. ["package >= 1.2.3"]
Would this enforce that package on various components
This is a per component control, so you can have different packages / containers based on the componnent
Would it then no longer capture import statements?
This is replacing the auto detected packages, but obviously this fails to detect your internal repo package, which is the main issue here.
How is "internal package" installed, in o...
available agent, i.e. not running anything else.
I mean how long would instance 1 wait until instance 2 of the experiment is up and running?
In other words what happens of all the nodes/agents are working and we still "need" additional instance.
This is basically like "pre-allocating" the nodes, only they wait in real-time until the additional node joins them.
Agent A pulls the 3 node Task, the Task clones itself (Task B) and enqueues on "very high priory queue" Task A wait until Task B is ru...
OHH nice, I thought that it just some kind of job queue on up and running machines
It's much more than that, it's a way of life π
But seriously now, it allows you to use any machine as part of your cluster, and send jobs for execution from the web UI (any machine, even just a standalong GPU machine under your desk, or any cloud GPU instance any mixing the two together:)
Maybe I need to change something here:Β
apiserver.conf
Not sure, I'm still waiting on answer... It...
(It would be nice to have all the Pypi releases tagged in github btw)
I wanted to say, we listen ... and point to the tag , but for some reason it was not pushed LOL.
'config.pbtxt' could not be inferred. please provide specific config.pbtxt definition.
This basically means there is no configuration on how to serve the mode, i.e. size/type of lower (input) layer and output layer.
You can wither store the configuration on the creating Task, like is done here:
https://github.com/allegroai/clearml-serving/blob/b5f5d72046f878bd09505606ca1147d93a5df069/examples/keras/keras_mnist.py#L51
Or you can provide it as standalone file when registering the mo...
One option is definitely having a base image that has the things needed. Anything else? Thanks!
This is a bit complicated, to get the cache to kick in you have to mount an NFS file into the pod as the cache (to create a persistent cache)
Basically, spin NFS pod to store the cache, change the glue job template yaml to mount it into the pod (see default cache folders:
/root/.cache/pip and /root/.clearml/pip-download-cache)
Make sense ?
Hi ItchyJellyfish73
You can always archive a Task/Model even when published
In the UI you can right-click and choose archive.
From code you need to add a system tag "archived"from clearml import Task t = Task.get_task(task_id='aabb') t.set_system_tags(t.get_system_tags() + ['archived'])
And similarly for Model(model_id='aabb')
So, what I am referring to is the ability of a system to allow some rigor and robustness of tracking of experiments, and also enforcing some thoughts on how things might be deployed, early on in the development process, whilst not being overly prescriptive and cumbersome
I'm cannot agree more!!
VivaciousPenguin66 We are working on trying to better understand how to solve this very delicate act of balance and offer some sort of "JIRA" for ML.
If this is okay with you, once product pe...
Hi, I would like to understand how I can set the pip cache location for my agent,
ClumsyElephant70 by default the pip cache (and all other cache folders) are mounted back into the host itself ~/.clearml/
I'm assuming the idea is shared cache, if this is the case, do:docker_pip_cache = ~/my_shared_nfs/pip-cache
https://github.com/allegroai/clearml-agent/blob/e3e6a1dda81bee2dd20a64d09746568e415f1823/docs/clearml.conf#L139
In theory, one could go over previously executed tasks, and create a copy of a specific scalar metric.
ShallowCat10 does that make sense in your scenario ?
but never executes/enqueues them (they are all inΒ
Draft
Β mode).
All pipeline steps are not enqueued ?
Is the pipeline controller itself running?
Is this like a local minio?
What do you have under the sdk/aws/s3 section
?
From the docs I think what's going on is that the https://opennmt.net/OpenNMT-tf/package/opennmt.Runner.html#opennmt.Runner.train is spinning a new subprocess, and the training itself happens on the subprocess.
If this is the case this will explain the lack of automagic, as the subprocess is lacking the "Task.init" call
wdyt, could that be the case ?