Reputation
Badges 1
75 × Eureka!I also use TB.
I solved the issue by implementing my own ClearML logger
What I am seeing is that the agent always fails trying to install some packages when I am not asking it at all
I have no idea what it is doing
CostlyOstrich36 in installed packages it has:
` # Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
Pillow == 9.2.0
clearml == 1.7.1
minio == 7.1.12
numpy == 1.23.1
pandas == 1.5.0
scikit_learn == 1.1.2
tensorboard == 2.10.1
torch == 1.12.1
torchvision == 0.13.1
tqdm == 4.64.1 `Which is the same as I have locally and on the server that runs clearml-agent
Yeah, pytorch is a must. This script is a testing one, but after this I need to train stuff on GPUs
I am doing clearml-agent --docker … --foreground --gpus 1
CostlyOstrich36 CLEARML-AGENT version 1.3.0
@<1523701205467926528:profile|AgitatedDove14> thanks!
Locally I have a conda env with some packages and a basic requirements file.
I am running this thing:
` from clearml import Task, Dataset
task = Task.init(project_name='Adhoc', task_name='Dataset test')
task.execute_remotely(queue_name="gpu")
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')
dataset_path = Dataset.get(
dataset_name=DATASET_NAME,
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)
print('Dataset path', d...
"realmodelonly.pkl"
should be the full path, or just the file name?
Pytorch is configured on the machine that’s running the agent. It’s also in requirements
The issue was that nvidia-docker2 was not installed on the machine where I was trying to run the agent. Following this guide fixed it:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:INFO:clearml_agent.commands.worker:Downloading " " to pip cache Collecting torch==1.12.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torch INFO:clearml_agent.commands.worker:Downloading " ` " to pip cache
Collecting torchvision==0.13.1+cu116
File was...
I resolved the issues by making my own docker image and making all envs the same:
The env that runs clearml-agent The docker env for running tasks in The env that requests task execution (my client)
I mean if I enter my host machine ssh password it works. But we will disable password auth in future, so it’s not an option
The task log is here:
the log on my local machine is here:
All ports are open (both agent machine and client machine are working within same VPN)
I dont have a short version.
I am using community clearml. How do I find out my version?
For a hacky way you can do docker ps and see the docker run command. I believe it contains the task id, so you can grep by task id
Ok, it makes sense. But it’s running in docker mode and it is trying to ssh into the host machine and failing
I can telnet the port from my mac:(base) *[main][~/Documents/plant_age]$ telnet 192.168.1.55 10022 Trying 192.168.1.55... Connected to 192.168.1.55. Escape character is '^]'. SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 ^C
Also manually installing this torch version succeeds:
` (base) boris@adamastor:~$ python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Processing ./.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: typing-extensions in ./miniconda3/lib/python3.10/site-packages (from torch==1.12.1+cu116) (4.3.0)
Installing collected packages: torch
Attempting uninstall: torch
...
Here’s the agent config. It’s basically default
https://justpaste.it/4ozm3