Reputation
Badges 1
981 × Eureka!Same, it also returns a ProxyDictPostWrite , which is not supported by OmegaConf.create
Yes, actually thats what I am doing, because I have a task C depending on tasks A and B. Since a Task cannot have two parents, I retrieve one task id (task A) as the parent id and the other one (ID of task B) as a hyper-parameter, as you described 👍
I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
I’ve reindexed the data for the logs, now the mappings are correct but I am missing one month of data, I have literally no idea where this data is/how it disappeared
Here are the logs of the agent :)
` (base) user@worker:~$ tail -f /tmp/.clearml_agent_daemon_outjdups8t2.txt
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
+----------------------------------+--------+-------+
| id | name | tags |
+----------------------------------+--------+-------+
| 54e4a62a402d5135612ba7b12cfe4e57 | docker | |
+----------------------------------+--------+-------+
Starting infinite tas...
btw task._get_task_property('hyperparams') also gives me ValueError: Task has no hyperparams section defined
(I didn't have this problem so far because I was using ssh keys globaly, but I want know to switch to git auth using Personal Access Token for security reasons)
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
btw I see in the pytorch_distributed_example I see that you average_gradients , but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
Stopping the server Editing the docker-compose.yml file, adding the logging section to all services Restarting the serverDocker-compose freed 10Go of logs
The parent task is a data_processing task, therefore I retrieve it so that I can then data_processed = parent_task.artifacts["data_processed"]
There is no way to filter on long types? I can’t believe it
Yes, it works now! Yay!
and saved locally, which is why the second task, not executed in the same machine, cannot access the file
Thats how I would do it, maybe guys from allegro-ai can come up with a better approach 👍
Ok thanks! And for this?
Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
As you can see, more hard waiting (initial sleep), and then before each apt action, make sure there is no lock
Sure, it’s because of a very annoying bug that I shared in this https://clearml.slack.com/archives/CTK20V944/p1648647503942759 , that I couldn’t solve so far.
I’m not sure you can downgrade that easily ...
Yea that’s what I thought, that’s a bit of pain for me now, I hope I can find a way to fix the bug somehow
Actually I think I am approaching the problem from the wrong angle
That’s why I said “not really” 😄
Thanks for your input TenseOstrich47 , I was considering using a secret manager now, I guess that's the best option. I can move the secrets wherever I need them to be to make it work 🙂
Hi SmugDolphin23 thanks for the input! Will try now but that seems hacky: to have it working I have to specify python3.8 two times:
one in the agent config file (agent.default_python is already python3.8, but seems to be ignored) + make sure it is available (using python:3.8 docker image)Is there a way to prevent this redundancy? Ie. If I want to change the python version, I can control it from a single place?
SuccessfulKoala55 They do have the right filepath, eg:https://***.com:8081/my-project-name/experiment_name.b1fd9df5f4d7488f96d928e9a3ab7ad4/metrics/metric_name/predictions/sample_00000001.png
This works well when I run the agent in virtualenv mode (remove --docker )
Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...
Awesome! (Broken link in migration guide, step 3: https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/ )