Reputation
Badges 1
981 × Eureka!automatically promote models to be served from within clearml
Yes!
I hit F12 to check projects.get_all_ex but nothing is fired, I guess the web ui is just frozen in some weird state
Nice, thanks!
I will try with that and keep you updated
SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didnโt update the apiserver.auth.cookies.domain setting - I did it, restarted and now it works ๐ Thanks!
That would be amazing!
Oof now I cannot start the second controller in the services queue on the same second machine, it fails with
` Processing /tmp/build/80754af9/cffi_1605538068321/work
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1605538068321/work'
clearml_agent: ERROR: Could not install task requirements!
Command '['/home/machine/.clearml/venvs-builds.1.3/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r'...
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
I have a mental model of the clearml-agent as a module to spin my code somewhere, and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
Yes, actually thats what I am doing, because I have a task C depending on tasks A and B. Since a Task cannot have two parents, I retrieve one task id (task A) as the parent id and the other one (ID of task B) as a hyper-parameter, as you described ๐
but the post_packages does not reinstalls the version 1.7.1
In execution tab, I see old commit, in logs, I see an empty branch and the old commit
And since I ran the task locally with python3.9, it used that version in the docker container
Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...
I opened an https://github.com/pytorch/ignite/issues/2343 in igniteโs repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init in distributed envs
I managed to do it by using logger.report_scalar, thanks!
/data/shared/miniconda3/bin/python /data/shared/miniconda3/bin/clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
I mean, inside a parent, do not show the project [parent] if there is nothing inside
line 13 is empty ๐ค
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
AgitatedDove14 So in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping class I see that some infos are logged (in the __call__ function), and I would like to have these infos logged by clearml
Thanks SuccessfulKoala55 ๐
AgitatedDove14 If I call explicitly task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0) , this will log as expected one value per process, so reporting works
AgitatedDove14 After investigation, another program on the machine consumed all the memory available, most likely making the OS killing the agent/task
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
Very good job! One note: in this version of the web-server, the experiments logo types are all blank, what was the reason to change them? Having a color code in the logos helps a lot to quickly check the nature of the different experiments tasks, isnt it?
Yes ๐ Thanks!