Oh that makes sense, This depends on how you setup the clearml k8s glue, (becuase the resource allocation is done by k8s) a good hack to limit the number of containers per GPU is to set a RAM limitation per pod, then k8s will know to limit the number of pods on the same GPU machine,
wdty?
So are you saying why do we need to install a specific pip version ?
You can "disable it" by selecting a very high versionpip_version: "<40"
https://github.com/allegroai/clearml-agent/blob/077148be00ead21084d63a14bf89d13d049cf7db/docs/clearml.conf#L67
This is an odd error, could it be conda is not installed in the container (or in the Path) ?
Are you trying with the latest RC?
Hi GreasyPenguin14
Could you tell me what the differences are and why we should use ClearML data?
The first difference is in the approach itself, DVC ties the data with the code (i.e. git repo), where we (ClearML - but not just us) actually think data should be abstracted from the Code-Base and become a standalone argument, allowing users to build/execute against different dataset/versions. ClearML Data becomes part of the workflow as it is visible from the UI including the abili...
o i have to upload and run a script with its default value first (since I don't have an initial task id), then clone it,Β edit the configuration inside that newly cloned one, get the id of the clone, and pass this into my script as the task_id and run it from my machine?
Correct. You can also create it (from code), "Reset" it (right click in the UI) and then edit it.
Is there a way do this without running it on my machine?
check clearml-task
it is a CLI that will create ...
So you want to have two Tasks and connect the two ?
Maybe the best approach is to have th current_task. the parent of the Dataset Task ?dataset._task.set_parent(Task.current_task())
I tried specifying helpers functions but it still gives the same error.
What's the error you are getting ?
And when exactly are you getting the "user aborted" message)?
How do you start the process (are you manually running it, or is it an agent, or maybe pycharm?)
Can you provide the full log ?
WackyRabbit7 This is a json representation of the entire plot (basically how plotly sees it).
What you are after is:full_json[0]['cells']['values']
Which is a list of lists (row order) in the table
SmallDeer34
I think this is somehow related to the JIT compiler torch is using.
My suspicion is that JIT cannot be initialized after something happened (like a subprocess, or a thread).
I think we managed to get around it with 1.0.3rc1.
Can you verify ?
Can clearml-agent currently detect this?
Hmm you mean will agent clean it self up?
task = Task.init(project_name='debug', task_name='test tqdm cr cl') print('start') for i in tqdm.tqdm(range(100), dynamic_ncols=True,): sleep(1) print('done')
This code snippet works as expected (console will show the progress at the flush interval without values in between). What's the difference ?!
Yes, just set system_site_packages: true
in your clearml.conf
https://github.com/allegroai/clearml-agent/blob/d9b9b4984bb8a83914d0ec6d53c86c68bb847ef8/docs/clearml.conf#L57
TrickyRaccoon92
I guess elegant is the challenge π
What exactly is the use case ?
The reason is because it is logged as an image, not a plot π
SubstantialElk6 on the client side?
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments?Yes it is suported, and should work
If so, would it be started via python ...
or via torchrun ...
?Yes it should, hence the request for a code snippet to reproduce the issue you are experiencing
What about remote runs, how will they support the parallel execution?Supported, You should see in the "script entry" something like "-m -m torch.di...
Interesting, do you think you could PR a "fixed" version ?
https://github.com/allegroai/clearml-web/blob/2b6aa6043c3f36e3349c6fe7235b77a3fddd[β¦]app/webapp-common/shared/single-graph/single-graph.component.ts
Hi DefeatedCrab47
You should be able to change the Web server port , but API port (8008) cannot be changed. If you can login to the web app and create a project it means everything is okay. Notice that when you configure trains ( trains-init
) the port numbers are correct π
Hi @<1523703472304689152:profile|UpsetTurkey67>
I circumvented the problem by putting timestamp in task name, but I don't think this is necessary.
Just pass reuse_last_task_id=False
to Task.init, it will never try to reuse them π
None
Although I didn't understand why you mentioned
torch
in my case?
Just a guess π other frameworks do multi-process as well,
I would guess it relates to parallelization of Tasks execution of the
HyperParameterOptimizer
class?
Yes that might be it, it's basically by product of using python "Process" class for multiprocessing. we are working on a fix, not a trivial unfortunately
however, this will also turn off metricsΒ
For the sake of future readers, let me clarify on this one, turning it off auto_connect_frameworks={'pytorch': False}
only effects the auto logging of torch.save/load
(side note: the reason is pytorch does not have built in metric reporting, i.e. it is usually done manually and these days most probably with tensorboard, for example lightning / ignite will use tensorboard as default metric reporting),
Hi WittyOwl57
I think what happens is it auto-logs the joblib load/save calls, these calls track models used/created by the code, and attach them to the model repository representing these model.
I'm assuming there are multiple load/save , and there are multiple model instances pointing to the same local file "file:///tmp/..." . The earning basically says it is re-registering existing models.
Make sense ?
So good news (1) Dashboard is being worked on as we speak. (2) we released clearml-serving doing exactly that, the next release of clearml-serving will include integration with kfserving (under the hood) essentially managing the serving endpoints on top of the k8s cluster , wdyt?
BTW: if you feel like pushing forward with integration I'll be more than happy to help PRing new capabilities, even before the "official" release
Sadly no π
(I mean you could quickly write a reader for TB and report it, but it is not built into the SDK)