Reputation
Badges 1
981 × Eureka!Interesting idea! (I assume for reporting only, not configuration)
Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded
regrading the cuda check with
nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...
Ok, but when nvcc is not ava...
Here is (left) the data disk (/opt/clearml) and right the OS disk
Iβve reindexed the data for the logs, now the mappings are correct but I am missing one month of data, I have literally no idea where this data is/how it disappeared
when can we expect the next self hosted release btw?
Yes, perfect!!
And after the update, the loss graph appears
So it can be that when restarting the docker-compose, it used another volume, hence the loss of data
AgitatedDove14 https://clear.ml/docs/latest/docs/apps/clearml_session/#running-in-docker in the docs there is a --docker option, thatβs what confuses me, since the agent should always run in docker mode
I am using 0.17.5, it could be either a bug on ignite or indeed a delay on the send. I will try to build a simple reproducible example to understand to cause
My use case it: in a spot instance marked for termination after 2 mins by aws, I want to close a task and prevent the clearml-agent to pick up a new task after.
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0) Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)So I guess itβs not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
Probably 6. I think because of some reason, it did not go back to main trains-agent. Nevertheless I am not sure, because a second task could start. It could also be that the second was aborted for some reason while installing task requirements (not system requirements, so executing the trains-agent setup within the docker container) and therefore again it couldn't go back to main trains-agent. But ps -aux shows that the trains-agent is stuck running the first experiment, not the second...
Iβll definitely check that out! π€©
Hey FriendlySquid61 ,
I ended up asking for full control of EC2 not to be blocked, so unfortunately I cannot give you a more precise list π
Thanks! I would like to use this opportunity to split the indices into multiple shards, as explained here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html#indices-split-index
ProxyDictPostWrite._to_dict() will recursively convert to dict and OmegaConf will not complain then
btw CostlyOstrich36 , I can see in Profile > Version: 1.1.1-135 β’ 1.1.1 β’ 2.14 . What these numbers correspond to?
I am using pip as a package manager, but i start the trains-agent inside a conda env π
Very good job! One note: in this version of the web-server, the experiments logo types are all blank, what was the reason to change them? Having a color code in the logos helps a lot to quickly check the nature of the different experiments tasks, isnt it?
From my experience, I only installed cuda drivers on my machines. I didn't used conda to install torch nor cudatoolkit, I just let clearml-agent download the torch wheel file and install it
Hey @<1523701205467926528:profile|AgitatedDove14> , Actually I just realised that I was confused by the fact that when the task is reset, because of the sorting it disappears, making it seem like it was deleted. I think it's a UX issue: When I click on reset.
- The pop shows "Deleting 100%"
- The task disappears in the list of tasks because of the sortingThis led me to thing that there was a bug and the task was deleted
What is latest rc of clearml-agent? 1.5.2rc0?
SuccessfulKoala55 I deleted all :monitor:machine and :monitor:gpu series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz . I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?
Same, it also returns a ProxyDictPostWrite , which is not supported by OmegaConf.create
The part where I'm lost is why would you need the path to the temp venv the agent creates/uses ?
let's say my task is calling a bash script, and that bash script is calling another python program, I want that last python program to be executed with the environment that was created by the agent for this specific task
And if you need a very small change, you can also simply https://www.geeksforgeeks.org/monkey-patching-in-python-dynamic-behavior/ it