Reputation
Badges 1
981 × Eureka!So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.
btw CostlyOstrich36 , I can see in Profile > Version: 1.1.1-135 • 1.1.1 • 2.14 . What these numbers correspond to?
I am using pip as a package manager, but i start the trains-agent inside a conda env 😄
Very good job! One note: in this version of the web-server, the experiments logo types are all blank, what was the reason to change them? Having a color code in the logos helps a lot to quickly check the nature of the different experiments tasks, isnt it?
From my experience, I only installed cuda drivers on my machines. I didn't used conda to install torch nor cudatoolkit, I just let clearml-agent download the torch wheel file and install it
Hey @<1523701205467926528:profile|AgitatedDove14> , Actually I just realised that I was confused by the fact that when the task is reset, because of the sorting it disappears, making it seem like it was deleted. I think it's a UX issue: When I click on reset.
- The pop shows "Deleting 100%"
- The task disappears in the list of tasks because of the sortingThis led me to thing that there was a bug and the task was deleted
What is latest rc of clearml-agent? 1.5.2rc0?
SuccessfulKoala55 I deleted all :monitor:machine and :monitor:gpu series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz . I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?
Same, it also returns a ProxyDictPostWrite , which is not supported by OmegaConf.create
The part where I'm lost is why would you need the path to the temp venv the agent creates/uses ?
let's say my task is calling a bash script, and that bash script is calling another python program, I want that last python program to be executed with the environment that was created by the agent for this specific task
And if you need a very small change, you can also simply https://www.geeksforgeeks.org/monkey-patching-in-python-dynamic-behavior/ it
region is empty, I never entered it and it worked
SuccessfulKoala55 I am using ES 7.6.2
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
I will probably just use everywhere an absolute path to be robust against different machine user accounts: /home/user/trains.conf
AgitatedDove14 If I call explicitly task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0) , this will log as expected one value per process, so reporting works
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
From the answers I saw on the internet, it is most likely related to the mismatch of cuda/cudnn version
in the UI the value is correct one (not empty, a string)
select multiple lines still works, you need to shift + click on the checkbox
DeterminedCrab71 Please check this screen recording
I just checked if something changed in https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_config.html#web-login-authentication
I am also interested in the clearml-serving part 😄
The weird thing is that the second experiment started immediatly, correctly in a docker container, but failed with User aborted: stopping task (3) at some point (while installing the packages). The error message is suprizing since I did not do anything. And then all following experiments are queued to services queue and stuck there.
mmmh there is no closing of the task happening at that point. Note that just before the task.upload_artifact, I call task.logger.report_table("Metric summary", "Metric summary", 0, df_scores) , if that matters
It broke the shift holding to select multiple experiments btw