Reputation
Badges 1
75 × Eureka!The issue was that nvidia-docker2
was not installed on the machine where I was trying to run the agent. Following this guide fixed it:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
Is there some minimal example of a docker env agent I can run, just to see that it works?
AgitatedDove14
made a new one:
https://pastebin.com/LxLFk7py
Sure, will send in a few min when it executes
Ok, it makes sense. But it’s running in docker mode and it is trying to ssh into the host machine and failing
"realmodelonly.pkl"
should be the full path, or just the file name?
@<1523701205467926528:profile|AgitatedDove14> thanks!
(base) boris@adamastor:~/clearml_config$ clearml-agent --version CLEARML-AGENT version 1.4.0
I also use TB.
I solved the issue by implementing my own ClearML logger
CostlyOstrich36 in installed packages it has:
` # Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
Pillow == 9.2.0
clearml == 1.7.1
minio == 7.1.12
numpy == 1.23.1
pandas == 1.5.0
scikit_learn == 1.1.2
tensorboard == 2.10.1
torch == 1.12.1
torchvision == 0.13.1
tqdm == 4.64.1 `Which is the same as I have locally and on the server that runs clearml-agent
Locally I have a conda env with some packages and a basic requirements file.
I am running this thing:
` from clearml import Task, Dataset
task = Task.init(project_name='Adhoc', task_name='Dataset test')
task.execute_remotely(queue_name="gpu")
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')
dataset_path = Dataset.get(
dataset_name=DATASET_NAME,
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)
print('Dataset path', d...
Yeah, pytorch is a must. This script is a testing one, but after this I need to train stuff on GPUs
The image I am using is pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
AgitatedDove14 With --debug
I see that after installing packages there is an endless stream of this:
` Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842e8be0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login
Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnec...
On the agent side it’s trying to install different pytorch versions (even though the env already has it all configured), then fails with torch_<something>.whl is not a valid wheel for this system
Agent is running in docker mode. The host OS is ubuntu
Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI
This issue was resolved by setting the correct clearml.conf
(replacing localhost with a public hostname for the server) 🙂
Yes, I am able to clone locally on the same server the agent is running on. However I do it using ssh auth
Yes, the git user is correct. It does not display the password of course. I tested and the config is definitely coming from clearml.conf
Still, the error persists
The failure is that it does not even run
I dont have a short version.
I am using community clearml. How do I find out my version?
Is there a way to debug what is happening?
The task log is here:
the log on my local machine is here:
Is there a way to check if the port is accessible from my local machine?
(But in venv mode is also hangs the same way)