Reputation
Badges 1
25 × Eureka!hi @<1727497172041076736:profile|TightSheep99>
yes, but also set the dataset chunk filesize to something smaller than the default 500Gb (--chunk-size)
None
Thanks@doru! BTW if you are running a code from outside the trains repo, do you still get the double package?
I think it fails because it tries to install trains twice. Could you remove the trains package, and test? I'm also curious how do you have both installed?!
HealthyStarfish45 if I understand correctly the trains-agent is running as daemon (i.e. automatically pulling jobs and executes them), the only point might be cancelling a daemon will cause the Task executed by that daemon to be canceled as well.
Other than that, sounds great!
you mean to spin a pod with the agent inside it (daemon in services mode).
Or connect the services queue to the k8s cluster (i.e. define the pod template that uses cpu with not a lot of ram)?
https://hub.docker.com/layers/nvidia/cuda/10.1-cudnn7-runtime-ubuntu18.04/images/sha256-963696628c9a0d27e9e5c11c5a588698ea22eeaf138cc9bff5368c189ff79968?context=explore
the docker image is missing the cudnn which is a must for TF to work π
TrickyRaccoon92
I guess elegant is the challenge π
What exactly is the use case ?
Right, I think the naming is a by-product of Hydra / TB
I guess itβs on me to check whether this slowdown is negligible or not
Usually performance is negligible, especially with GPU
But if you really want the best:
Add --security-opt seccomp=unconfined to the extra_docker_arguments
See detials:
https://betterprogramming.pub/faster-python-in-docker-d1a71a9b9917
Probably less secure though :)
Where are you seeing this message?
I still think the issue is getting boto3 credentials
It might be the case
Are you using clearml-agent or are you running it manually ?
Hi @<1720249416255803392:profile|IdealMole15>
I'm assuming you mean on a remote machine with clearml-agent running ?
If you do, then you either use clearml-task to create a Task (Job) and specify the container and script. or click on "Create New Experiment" in the UI, and fill out the git repo / script and specify the docker image.
Make sense?
HelplessCrocodile8 I just tried it, everything seems to work (ubuntu 20.04) π
What's the OS your are using? Python version? Is it conda ?
Hi UpsetBlackbird87
I might be wrong, but it seems like ClearML does not monitor GPU pressure when deploying a task to a worker rather rely only on its configured queues.
This is kind of accurate, the way the agent works is that you allocate a resource for the agent (specifically a GPU), then sets queues (plural) to listen to (by default priority ordered). Then each agent is individually pulling jobs and running on the allocated GPU.
If I understand you correctly, you want multiple ...
Also btw, is this supposed to be screenshot from community verison
Hmm seems like screenshot from an enterprise version, I'll ask them to update π
I am also not understanding how clearml-serving is doing the version for models in triton.
Basically you have two Tasks, one is the "controller" checking model changes and updating itself.
The other is the engine, checking on the "controller" Task, which models it needs to download/configure and replaces them.
This way you can ha...
IrritableJellyfish76 point taken, suggestions on improving the interface ?
CooperativeFox72 yes 20 experiments in parallel means that you always have at least 20 connection coming from different machines, and then you have the UI adding on top of it. I'm assuming the sluggishness you feel are the requests being delayed.
You can configure the API server to have more process workers, you just need to make sure the machine has enough memory to support it.
RobustSnake79 I have not tested, but I suspect that currently all the reports will stay in TB and not passed automagically into ClearML
It seems like something you would actually want to do with TB (i.e. drill into the graphs etc.) no?
(the payload is not the correct form, can that be a problem?
It might, but I assume you will get a different error
RobustGoldfish9 I see.
So in theory spinning an experiment on an gent would be clone code -> build docker -> mount code -> execute code inside docker?
(no need for requirements etc.?)
What do you mean by a custom queue ?
In the queues page you have a plus button, this will just create a new queue
Can you verify this example is not working for you?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py
Thanks SmallDeer34 !
Would you like us to? How about a footnote/acknowledgement?
How about a reference / footnote ?@misc{clearml, title = {ClearML - Your entire MLOps stack in one open-source tool}, year = {2019}, note = {Software available from }, url={ }, author = {allegro.ai}, }
Yes, I think you are correct, verified on Firefox & Chrome. I'll make sure to pass it along.
Thanks SteadyFox10 !