Reputation
Badges 1
75 × Eureka!The task log is here:
the log on my local machine is here:
Sure, will send in a few min when it executes
I can telnet the port from my mac:(base) *[main][~/Documents/plant_age]$ telnet 192.168.1.55 10022 Trying 192.168.1.55... Connected to 192.168.1.55. Escape character is '^]'. SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 ^C
Is there a way to check if the port is accessible from my local machine?
Btw it seems the docker runs in network=host
Locally I have a conda env with some packages and a basic requirements file.
I am running this thing:
` from clearml import Task, Dataset
task = Task.init(project_name='Adhoc', task_name='Dataset test')
task.execute_remotely(queue_name="gpu")
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')
dataset_path = Dataset.get(
dataset_name=DATASET_NAME,
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)
print('Dataset path', d...
What I am seeing is that the agent always fails trying to install some packages when I am not asking it at all
CostlyOstrich36 in installed packages it has:
` # Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
Pillow == 9.2.0
clearml == 1.7.1
minio == 7.1.12
numpy == 1.23.1
pandas == 1.5.0
scikit_learn == 1.1.2
tensorboard == 2.10.1
torch == 1.12.1
torchvision == 0.13.1
tqdm == 4.64.1 `Which is the same as I have locally and on the server that runs clearml-agent
On the agent side it’s trying to install different pytorch versions (even though the env already has it all configured), then fails with torch_<something>.whl is not a valid wheel for this system
Here’s the agent config. It’s basically default
https://justpaste.it/4ozm3
Pytorch is configured on the machine that’s running the agent. It’s also in requirements
(agent) adamastor@adamastor:~/clearml_agent$ python -c "import torch; print(torch.__version__)" 1.12.1
Let me get the exact error for you
For a hacky way you can do docker ps
and see the docker run command. I believe it contains the task id, so you can grep by task id
CostlyOstrich36 CLEARML-AGENT version 1.3.0
Agent is running in docker mode. The host OS is ubuntu
Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI
I guess this pip package installation happens as part of docker build
The image I am using is pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
Definitely not, the machine has 5 TB and is a recent clear install
(But in venv mode is also hangs the same way)
Is there a way to debug what is happening?
I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )
Gets stuck onInstalling collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads
` (base) boris@adamastor:~$ docker exec -it angry_edison bash
root@041c0736c...
So the only process is something called /usr/local/bin/python3.10 -u -m clearml_agent execute
.
So I guess pip install finished working
But the task is evidently not being executed.
Good idea. I can just ssh into the container of task execution, right?
Is there some minimal example of a docker env agent I can run, just to see that it works?
AgitatedDove14 With --debug
I see that after installing packages there is an endless stream of this:
` Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842e8be0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login
Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnec...
This issue was resolved by setting the correct clearml.conf
(replacing localhost with a public hostname for the server) 🙂
AgitatedDove14 This example does not specify how to start a clearml-agent with docker such that it actually executes the task