Reputation
Badges 1
981 × Eureka!SuccessfulKoala55 I deleted all :monitor:machine and :monitor:gpu series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz . I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?
Same, it also returns a ProxyDictPostWrite , which is not supported by OmegaConf.create
The part where I'm lost is why would you need the path to the temp venv the agent creates/uses ?
let's say my task is calling a bash script, and that bash script is calling another python program, I want that last python program to be executed with the environment that was created by the agent for this specific task
And if you need a very small change, you can also simply https://www.geeksforgeeks.org/monkey-patching-in-python-dynamic-behavior/ it
region is empty, I never entered it and it worked
SuccessfulKoala55 I am using ES 7.6.2
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
I will probably just use everywhere an absolute path to be robust against different machine user accounts: /home/user/trains.conf
AgitatedDove14 If I call explicitly task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0) , this will log as expected one value per process, so reporting works
From the answers I saw on the internet, it is most likely related to the mismatch of cuda/cudnn version
in the UI the value is correct one (not empty, a string)
select multiple lines still works, you need to shift + click on the checkbox
DeterminedCrab71 Please check this screen recording
I just checked if something changed in https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_config.html#web-login-authentication
I am also interested in the clearml-serving part π
The weird thing is that the second experiment started immediatly, correctly in a docker container, but failed with User aborted: stopping task (3) at some point (while installing the packages). The error message is suprizing since I did not do anything. And then all following experiments are queued to services queue and stuck there.
mmmh there is no closing of the task happening at that point. Note that just before the task.upload_artifact, I call task.logger.report_table("Metric summary", "Metric summary", 0, df_scores) , if that matters
It broke the shift holding to select multiple experiments btw
If I remove security_group_ids and just let subnet_id in the configuration, it is not taken into account (the instances are created in a the default subnet)
(Even if I explicitely do my_task.close() )
So in my use case each step would create a folder (potentially big) and would store it as an artifact. The last step should βmergeβ all the pervious folders. The idea is to split the work among multiple machines (in parallel). I would like to avoid that these potentially big folder artifacts are also stored in the pipeline task, because this one will be running on the services queue in the clearml-server instance, that will definitely not have enough space to handle all of them
I will try with that and keep you updated
Sure yes! As you can see I just added the blocklogging: driver: "json-file" options: max-size: "200k" max-file: "10"To all services. Also in this docker-compose I removed the external binding of the ports for mongo/redis/es
I have the same problem, but not only with subprojects, but for all the projects, I get this blank overview tab as shown in the screenshot. It only worked for one project, that I created one or two weeks ago under 0.17
Hi AgitatedDove14 , thanks for the answer! I will try adding 'multiprocessing_context='forkserver' to the DataLoader. In the issue you linked, nirraviv mentionned that forkserver was slower and shared a link to another issue https://github.com/pytorch/pytorch/issues/15849#issuecomment-573921048 where someone implemented a fast variant of the DataLoader to overcome the speed problem.
Did you experiment any drop of performances using forkserver? If yes, did you test the variant suggested i...
python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached --gpus 1 > ~/trains-agent.startup.log 2>&1