
Reputation
Badges 1
25 × Eureka!Hi TrickyRaccoon92
TKinter
Β is suddenly used as backend, and instead of writes to the dashboard I get popups per figure.
Are you running with an agent of manually executing the code ?
Correct (basically pip freeze results)
Oh, then no, you should probably do the opposite π
What is the flow like now? (meaning what are you using kubeflow for and how)
. Could you clarify the question for me, please?
...
Could you please point me to the piece of ClearML code related to the downloading process?
I think I mean this part:
https://github.com/allegroai/clearml/blob/e3547cd89770c6d73f92d9a05696018957c3fd62/clearml/datasets/dataset.py#L2134
Okay, this seems to be the problem
Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Actually I am as well, this is Kubernets doing the resource scheduling and actually Kubernetes decided it is okay to run two pods on the Same GPU, which is cool, but I was not aware Nvidia already added this feature (I know it was in beta for a long time)
https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/
I also see thety added dynamic slicing and Memory Proteciton:
Notice you can control ...
I think so (you can also comment out the Task.init() just to verify this is not a clearml issue)
GiddyTurkey39 do you have an experiment with the jupyter notebook ?
Hi UpsetBlackbird87
I might be wrong, but it seems like ClearML does not monitor GPU pressure when deploying a task to a worker rather rely only on its configured queues.
This is kind of accurate, the way the agent works is that you allocate a resource for the agent (specifically a GPU), then sets queues (plural) to listen to (by default priority ordered). Then each agent is individually pulling jobs and running on the allocated GPU.
If I understand you correctly, you want multiple ...
Using the dataset.create command and the subsequent add_files, and upload commands I can see the upload action as an experiment but the data is not seen in the Datasets webpage.
ScantCrab97 it might be that you need the latest clearml
package installed on the client end (as well as the new server with the UI)
What is your clearml package version ?
ImmensePenguin78 this is probably for a different python version ...
ShortElephant92 yep, this is definitely enterprise feature π
But you can configure user/pass on the open source, even store as hasedh the passwords if you need.
ThickDove42 you need the latest cleaml-agent RC for the docker setup script (next version due next week)pip install clearml-agent==0.17.3rc0
Hi TrickyRaccoon92
Are you sure plotly (the front-end module displaying the plots in the UI) supports it ?
Ohh yes, if you deleted the token then you have to recreate the cleaml.conf
BTW: no need to generate a token, it will last π
Hi @<1657918706052763648:profile|SillyRobin38>
You mean remove the entire serving session? is it still running somewhere ?
(for example if you take the docker-compose down it will be marked aborted automatically after 2 hours)
JitteryCoyote63 any chance you have a log of the failed torch 1.7.0 ?
You might need to play around a bit, it might be that StorageHelper.get(' gs://bucket ') and then helper.list('folder/*')
Let me know what worked π
Hi IntriguedRat44
You can make log it offline (i.e. into a local folder/zip) by calling:Task.set_offline(True)
You can also set the environment variable:TRAINS_OFFLINE_MODE=1
You could also just skip the Trains.init
call π
Does that help?
I should mention this is run within a TF v1 session context
This should not be connected.
everything gets stored as intended (to clearML dashboard)
So in jupyter it works? But from command line it does not ? what's the difference ?
IrritableJellyfish76 if this is the case, my question is what is the reason to use Kubeflow? (jupyterLab server spinning is a good answer for example, pipelines are to my opinion a lot less)
DefeatedMoth52 how many agents do you have running on the same GPU ?
another option is the download fails (i.e. missing credentials on the client side, i.e. clearml.conf)
The class documentation itself is also there under "References" -> "Trains Python Package"
Notice that due to a bug in the documentation (we are working on a fix) the reference part is not searchable in the main search bar
I was unable to reproduce, but I added a few safety checks. I'll make sure they are available on the master in a few minutes, could maybe rerun after?
SarcasticSparrow10 LOL there is a hack around it π
Run your code with python -O
Which basically skips over all assertion checks
Check the log, the container has torch 1.13.0 but the task requires torch==1.13.1
Now torch package inside those nvidia prepackaged containers are compiled a bit differently . What I suspect happens is the torch wheel from pytorch is not compatible with this container . Easiest fix , change the task requirments to 1.13
Wdyt ?