![Profile picture](https://clearml-web-assets.s3.amazonaws.com/scoold/avatars/AdventurousButterfly15.png)
Reputation
Badges 1
75 × Eureka!Here’s the error I get:
https://justpaste.it/7aom5
It’s trying to downgrade pytorch to 1.12.1 for some reason (why?) using a version for an outdated CUDA (I have 11.7, it tries to use pytorch for CUDA 11.6). Finally crashes
Let me get the exact error for you
This issue was resolved by setting the correct clearml.conf
(replacing localhost with a public hostname for the server) 🙂
The failure is that it does not even run
Yeah, pytorch is a must. This script is a testing one, but after this I need to train stuff on GPUs
I am doing clearml-agent --docker … --foreground --gpus 1
I guess I am out of ideas. The config is wrong somewhere. Maybe double check all the configs? It’s taking the value from somewhere!
Ok, it makes sense. But it’s running in docker mode and it is trying to ssh into the host machine and failing
I don’t understand. The current cuda version is 11.7. Installed pytorch version is 1.12.1. Torch can access GPUs, all is fine.
Why does it try to install a different torch version?
` (base) boris@adamastor:~$ nvidia-smi
Fri Oct 7 14:16:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name ...
The agent is started from a non-root user if that matters
Sure, will so tomorrow
When trying it I realized that my local clearml.conf
had the old hostnames still ( adamastor.gaiavf.local
). Now your script returns the proper value of http://adamastor-office.periploinnovation.com:8081 . I will see if it works now!
@<1523701205467926528:profile|AgitatedDove14> thanks!
Pytorch is configured on the machine that’s running the agent. It’s also in requirements
Looking through history I found this link: None
Tldr: ClearML doesn’t support lightning
, but supports pytorch_lightning
. Downgrading from the new interface to the old one fixed my issue
Yes, the git user is correct. It does not display the password of course. I tested and the config is definitely coming from clearml.conf
Still, the error persists
For a hacky way you can do docker ps
and see the docker run command. I believe it contains the task id, so you can grep by task id
The task log is here:
the log on my local machine is here:
Upgraded, the issue persists
The issue was that nvidia-docker2
was not installed on the machine where I was trying to run the agent. Following this guide fixed it:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
Is there a way to check if the port is accessible from my local machine?
Sure, will send in a few min when it executes
All ports are open (both agent machine and client machine are working within same VPN)
I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )
Gets stuck onInstalling collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads
` (base) boris@adamastor:~$ docker exec -it angry_edison bash
root@041c0736c...
Yes, I am able to clone locally on the same server the agent is running on. However I do it using ssh auth
Locally I have a conda env with some packages and a basic requirements file.
I am running this thing:
` from clearml import Task, Dataset
task = Task.init(project_name='Adhoc', task_name='Dataset test')
task.execute_remotely(queue_name="gpu")
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')
dataset_path = Dataset.get(
dataset_name=DATASET_NAME,
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)
print('Dataset path', d...