Reputation
Badges 1
75 × Eureka!Yes, I am able to clone locally on the same server the agent is running on. However I do it using ssh auth
Locally I have a conda env with some packages and a basic requirements file.
I am running this thing:
` from clearml import Task, Dataset
task = Task.init(project_name='Adhoc', task_name='Dataset test')
task.execute_remotely(queue_name="gpu")
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')
dataset_path = Dataset.get(
dataset_name=DATASET_NAME,
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)
print('Dataset path', d...
Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI
Here’s the agent config. It’s basically default
https://justpaste.it/4ozm3
(agent) adamastor@adamastor:~/clearml_agent$ python -c "import torch; print(torch.__version__)" 1.12.1
(base) boris@adamastor:~/clearml_config$ clearml-agent --version CLEARML-AGENT version 1.4.0
Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:INFO:clearml_agent.commands.worker:Downloading "
" to pip cache Collecting torch==1.12.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torch INFO:clearml_agent.commands.worker:Downloading "
` " to pip cache
Collecting torchvision==0.13.1+cu116
File was...
I mean if I enter my host machine ssh password it works. But we will disable password auth in future, so it’s not an option
I guess this pip package installation happens as part of docker build
Btw it seems the docker runs in network=host
CostlyOstrich36 CLEARML-AGENT version 1.3.0
"realmodelonly.pkl"
should be the full path, or just the file name?
On the agent side it’s trying to install different pytorch versions (even though the env already has it all configured), then fails with torch_<something>.whl is not a valid wheel for this system
I have no idea what it is doing
What I am seeing is that the agent always fails trying to install some packages when I am not asking it at all
AgitatedDove14 With --debug
I see that after installing packages there is an endless stream of this:
` Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842e8be0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login
Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnec...
Also manually installing this torch version succeeds:
` (base) boris@adamastor:~$ python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Processing ./.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: typing-extensions in ./miniconda3/lib/python3.10/site-packages (from torch==1.12.1+cu116) (4.3.0)
Installing collected packages: torch
Attempting uninstall: torch
...
AgitatedDove14
made a new one:
https://pastebin.com/LxLFk7py
I dont have a short version.
I am using community clearml. How do I find out my version?
But what should I do? It does not work, it says incorrect password as you can see
Yes, I created a token and out it into aget.git_pass
The agent is started with this command:clearml-agent --debug daemon --queue gpu --gpus 0 --foreground --docker <gitlab org registry>/project-precog/clearml_config
I start clearml-session on my mac this way:clearml-session --queue gpu --docker registry.gitlab.com/periplo-innovation/project-precog/clearml_config
I resolved the issues by making my own docker image and making all envs the same:
The env that runs clearml-agent The docker env for running tasks in The env that requests task execution (my client)
I can telnet the port from my mac:(base) *[main][~/Documents/plant_age]$ telnet 192.168.1.55 10022 Trying 192.168.1.55... Connected to 192.168.1.55. Escape character is '^]'. SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 ^C
Is there a way to debug what is happening?
The issue disappeared after I switched from docker mode to pip mode