My bad, alpine is so light it doesnt have bash
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
You already fixed the problem with pyjwt in the newest version of clearml/clearml-agents, so all good π
CostlyOstrich36 How is clearml-session setting the ssh config?
this is the last line, same a before
Yes, but a minor one. I would need to do more experiments to understand what is going on with pip skipping some packages but reinstalling others.
Sorry, I refreshed the page and itβs gone π
Sure, just sent you a screenshot in PM
` ssh my-instance
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:O2++ST5lAGVoredT1hqlAyTowgNwlnNRJrwE8cbM...
Both ^^, I already adapted the code for GCP and I was planning to adapt to Azure now
so what worked for me was the following startup userscript:
` #!/bin/bash
sleep 120
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get update
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done
sudo apt-get install -y python3-dev python3-pip gcc git build-essential...
Although task.data.last_iteration Β is correct when resuming, there is still this doubling effect when logging metrics after resuming π
automatically promote models to be served from within clearml
Yes!
I hit F12 to check projects.get_all_ex but nothing is fired, I guess the web ui is just frozen in some weird state
Nice, thanks!
I will try with that and keep you updated
That would be amazing!
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
I have a mental model of the clearml-agent as a module to spin my code somewhere, and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
Yes, actually thats what I am doing, because I have a task C depending on tasks A and B. Since a Task cannot have two parents, I retrieve one task id (task A) as the parent id and the other one (ID of task B) as a hyper-parameter, as you described π
but the post_packages does not reinstalls the version 1.7.1
In execution tab, I see old commit, in logs, I see an empty branch and the old commit
And since I ran the task locally with python3.9, it used that version in the docker container
Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...
I managed to do it by using logger.report_scalar, thanks!
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
I mean, inside a parent, do not show the project [parent] if there is nothing inside