Reputation
Badges 1
981 × Eureka!Oof now I cannot start the second controller in the services queue on the same second machine, it fails with
` Processing /tmp/build/80754af9/cffi_1605538068321/work
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1605538068321/work'
clearml_agent: ERROR: Could not install task requirements!
Command '['/home/machine/.clearml/venvs-builds.1.3/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r'...
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
I have a mental model of the clearml-agent as a module to spin my code somewhere, and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
Yes, actually thats what I am doing, because I have a task C depending on tasks A and B. Since a Task cannot have two parents, I retrieve one task id (task A) as the parent id and the other one (ID of task B) as a hyper-parameter, as you described 👍
but the post_packages does not reinstalls the version 1.7.1
In execution tab, I see old commit, in logs, I see an empty branch and the old commit
And since I ran the task locally with python3.9, it used that version in the docker container
Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...
I managed to do it by using logger.report_scalar, thanks!
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
I mean, inside a parent, do not show the project [parent] if there is nothing inside
line 13 is empty 🤔
I now have a different question: when installing torch from wheels files, I am guaranteed to have the corresponding cuda library and cudnn together right?
AgitatedDove14 So in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping class I see that some infos are logged (in the __call__ function), and I would like to have these infos logged by clearml
AgitatedDove14 If I call explicitly task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0) , this will log as expected one value per process, so reporting works
AgitatedDove14 After investigation, another program on the machine consumed all the memory available, most likely making the OS killing the agent/task
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
Very good job! One note: in this version of the web-server, the experiments logo types are all blank, what was the reason to change them? Having a color code in the logos helps a lot to quickly check the nature of the different experiments tasks, isnt it?
Yes that’s what I did initially, but eventually I decided that it’s too much complexity added for nothing really, I’d rather drop omegaconf and if one day clearml supports it out of the box take advantage of it
If I don’t start clearml-session , I can easily connect to the agent, so clearml-session is doing something that messes up the ssh config and prevent me from ssh into the agent afterwards
So when I create a task using `task = Task.init(project_name=config.get("project_name"), task_name=config.get("task_name"), task_type=Task.TaskTypes.training, output_uri=" s3://my-bucket ") locally, the artifact is correctly logged remotely, but when I create the task remotely (from an agent) the artifact is logged locally (in the agent machine, not on s3)
Ho I wasn't aware of that new implementation, was it introduced silently? I don't remember reading it in the release notes! To answer your question: no, for gcp I used the old version, but for azure I will use this one, maybe send a PR if code is clean 👍
I don’t think it is, I was rather wondering how you handled it to understand potential sources of slow down in the training code
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
Is it safe to turn off replication while a reindex operation is happening? the reindexing is rather slow and I am wondering if turning of replication will speed up the process