Reputation
Badges 1
75 × Eureka!Let me get the exact error for you
Here’s the error I get:
https://justpaste.it/7aom5
It’s trying to downgrade pytorch to 1.12.1 for some reason (why?) using a version for an outdated CUDA (I have 11.7, it tries to use pytorch for CUDA 11.6). Finally crashes
Btw it seems the docker runs in network=host
Is there some minimal example of a docker env agent I can run, just to see that it works?
It's too much of a hack :)
CostlyOstrich36 in installed packages it has:
` # Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
Pillow == 9.2.0
clearml == 1.7.1
minio == 7.1.12
numpy == 1.23.1
pandas == 1.5.0
scikit_learn == 1.1.2
tensorboard == 2.10.1
torch == 1.12.1
torchvision == 0.13.1
tqdm == 4.64.1 `Which is the same as I have locally and on the server that runs clearml-agent
I have no idea what it is doing
Agent is running in docker mode. The host OS is ubuntu
Yeah, pytorch is a must. This script is a testing one, but after this I need to train stuff on GPUs
I am doing clearml-agent --docker … --foreground --gpus 1
Locally I have a conda env with some packages and a basic requirements file.
I am running this thing:
` from clearml import Task, Dataset
task = Task.init(project_name='Adhoc', task_name='Dataset test')
task.execute_remotely(queue_name="gpu")
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')
dataset_path = Dataset.get(
dataset_name=DATASET_NAME,
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)
print('Dataset path', d...
The failure is that it does not even run
All ports are open (both agent machine and client machine are working within same VPN)