Reputation
Badges 1
75 × Eureka!I can telnet the port from my mac:(base) *[main][~/Documents/plant_age]$ telnet 192.168.1.55 10022 Trying 192.168.1.55... Connected to 192.168.1.55. Escape character is '^]'. SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 ^C
The issue was that nvidia-docker2 was not installed on the machine where I was trying to run the agent. Following this guide fixed it:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
AgitatedDove14
made a new one:
https://pastebin.com/LxLFk7py
Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI
Sure, will so tomorrow
I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )
Gets stuck onInstalling collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads
` (base) boris@adamastor:~$ docker exec -it angry_edison bash
root@041c0736c...
I also use TB.
I solved the issue by implementing my own ClearML logger
Locally I have a conda env with some packages and a basic requirements file.
I am running this thing:
` from clearml import Task, Dataset
task = Task.init(project_name='Adhoc', task_name='Dataset test')
task.execute_remotely(queue_name="gpu")
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')
dataset_path = Dataset.get(
dataset_name=DATASET_NAME,
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)
print('Dataset path', d...
On the agent side it’s trying to install different pytorch versions (even though the env already has it all configured), then fails with torch_<something>.whl is not a valid wheel for this system
I guess I am out of ideas. The config is wrong somewhere. Maybe double check all the configs? It’s taking the value from somewhere!
Here’s the agent config. It’s basically default
https://justpaste.it/4ozm3
The agent is started with this command:clearml-agent --debug daemon --queue gpu --gpus 0 --foreground --docker <gitlab org registry>/project-precog/clearml_config
Good idea. I can just ssh into the container of task execution, right?
When trying it I realized that my local clearml.conf had the old hostnames still ( adamastor.gaiavf.local ). Now your script returns the proper value of http://adamastor-office.periploinnovation.com:8081 . I will see if it works now!