Reputation
Badges 1
25 × Eureka!I always have my notebooks in git repo but suddenly it's not running them correctly.
What do you mean?
Can I switch off git diff (change detection?)
Yes, Task.init(..., auto_connect_frameworks={"detect_repository": False})
hmm DeliciousKoala34
what are you getting if you put this at the top of your code (the one you are running in the remote docker)import os print([(k, os.environ[k]) for k in os.environ if k.startswith("CLEARML_")])
Hi JitteryCoyote63
Show running experimentsIt doesn't?
Have the legend clickable, to hide/show experiments based on their status:+1:
Have a line connecting points that are SOTA (example in https://paperswithcode.com/sota/image-generation-on-cifar-10 )I like that, how is that selected? (I know FE are thinking of replacing this entire graph library, so maybe good timing in terms of what to look at)
Hi ShallowArcticwolf27
First of all:
If the answer to number 2 is no, I'd loveee to write a plugin.
Always appreciated ❤
Now actually answering the Q:
Any torch.save (or any other framework save) will either register or automatically upload, the file (or folder) in the system. If this is a folder it will be zipped and uploaded, if a file just uploaded to to the assigned storage output (the cleaml-server, any object storage service, or shared folder). I'm not actually sure I...
can i run it on an agent that doesn't have gpu?
Sure this is fully supported
when i run clearml-serving it throughs me an error "please provide specific config.pbtxt definion"
Yes this is a small file that tells the Triton server how load the model:
Here is an example:
https://github.com/triton-inference-server/server/blob/main/docs/examples/model_repository/inception_graphdef/config.pbtxt
How come the second one is one line?
Hi GracefulDog98
As UnevenDolphin73 pointed you might be looking for https://clear.ml/docs/latest/docs/references/sdk/task#execute_remotely
Which will stop the current local process, and enqueue the task on the "default" queue, for the agent to execute.
Is this what you are looking for ?
The idea is you can run your code once in "development" mode, so you know everything is working, then from the UI (or programmatically) you can clone the experiment, edit the configuration (or anythin...
Hi ZippyAlligator65
You can configure it in the clearml.conf: see here:
https://github.com/allegroai/clearml-agent/blob/ebb955187dea384f574a52d059c02e16a49aeead/clearml_agent/backend_api/config/default/agent.conf#L202
(torchvision vs. cuda compatibility, will work on that),
The agent will pull the correct torch based on the cuda version that is available at runtime (or configured via the clearml.conf)
What are you seeing in the Task that was cloned (i.e. the one the HPO created not the original training task)?
by that I mean, configuration section, do you have the Args there ? (seems like the pic you attached, but I just want to make sure)
Also in the train.py file, do you also have Task.init ?
Interesting...
We could followup the .env configuration, and allow the clearml-task to add configuration files from cmd line. This will be relatively easy to add. We could expand the Environment support (that somewhat exists), and add the ability to read variables from .emv and Add them to an "hyperparemeter" section, named Environment. wdyt?
None
No they are not, they are taking the vscode backend and put it behind a webserver-ish
- I'm happy tp hear you found a work around
- Seems like there is something wrong with the way the pbtxt is being merged, but I need some more information
{'detail': "Error processing request: object of type 'NoneType' has no len()"}
Where are you seeing this error?
What are you seeing in the docker-compose log.
SoreDragonfly16 . In the hyper parameters Tab, you have "parallel coordinates" (next to the "add experiment" the button saying "values" press on it and there should be " parallel coordinates")
Is that it?
Thanks!
Hmm from here : None
Could it be you do not have privileges to the resource, or that you did not provide credentials ?
Did that autoscaler work before ?
Seems like a Task contained an invalid artifact link.
I wouldn't sweat over it, it basically a warning that it could not locate the actual file to delete (albeit an ugly warning 🙂 )
I think AnxiousSeal95 would know when will the new version be ready.
regardless, is it actually deleting old Tasks ?
Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Actually I am as well, this is Kubernets doing the resource scheduling and actually Kubernetes decided it is okay to run two pods on the Same GPU, which is cool, but I was not aware Nvidia already added this feature (I know it was in beta for a long time)
https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/
I also see thety added dynamic slicing and Memory Proteciton:
Notice you can control ...
Could it be pandas was not installed on the local machine ?
actually no
hmm, are those packages correct ?
I think the main issue is running with python -m module.name --args
Which is a bit different, when trying to "understand" what is the actual repository.
Can you try to run it from the repository folder (same command, just to see if it will have any effect on the detected packages)
BTW: how is it missing listing torch
? Do you have "import torch" in the code ?
BTW:
Error response from daemon: cannot set both Count and DeviceIDs on device request.
Googling it points to a docker issue (which makes sense considering):
https://github.com/NVIDIA/nvidia-docker/issues/1026
What is the host OS?
Okay, I'll make sure we always qoute "
, since it seems to work either way.
We will release an RC soon, with this fix.
Sounds good?