Reputation
Badges 1
75 × Eureka!Locally I have a conda env with some packages and a basic requirements file.
I am running this thing:
` from clearml import Task, Dataset
task = Task.init(project_name='Adhoc', task_name='Dataset test')
task.execute_remotely(queue_name="gpu")
from config import DATASET_NAME, CLEARML_PROJECT
print('Getting dataset')
dataset_path = Dataset.get(
dataset_name=DATASET_NAME,
dataset_project=CLEARML_PROJECT,
).get_local_copy()#.get_mutable_local_copy(DATASET_NAME)
print('Dataset path', d...
Pytorch is configured on the machine that’s running the agent. It’s also in requirements
It's too much of a hack :)
I am doing clearml-agent --docker … --foreground --gpus 1
(agent) adamastor@adamastor:~/clearml_agent$ python -c "import torch; print(torch.__version__)" 1.12.1
Yeah, pytorch is a must. This script is a testing one, but after this I need to train stuff on GPUs
CostlyOstrich36 in installed packages it has:
` # Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
Pillow == 9.2.0
clearml == 1.7.1
minio == 7.1.12
numpy == 1.23.1
pandas == 1.5.0
scikit_learn == 1.1.2
tensorboard == 2.10.1
torch == 1.12.1
torchvision == 0.13.1
tqdm == 4.64.1 `Which is the same as I have locally and on the server that runs clearml-agent
Here’s the agent config. It’s basically default
https://justpaste.it/4ozm3
Here’s the error I get:
https://justpaste.it/7aom5
It’s trying to downgrade pytorch to 1.12.1 for some reason (why?) using a version for an outdated CUDA (I have 11.7, it tries to use pytorch for CUDA 11.6). Finally crashes
Let me get the exact error for you
Yes, I created a token and out it into aget.git_pass
Upgraded, the issue persists
Yes, the git user is correct. It does not display the password of course. I tested and the config is definitely coming from clearml.conf
Still, the error persists
Yes, I am able to clone locally on the same server the agent is running on. However I do it using ssh auth
The issue disappeared after I switched from docker mode to pip mode
Is there a way to debug what is happening?
@<1523701205467926528:profile|AgitatedDove14> thanks!
Good idea. I can just ssh into the container of task execution, right?
This issue was resolved by setting the correct clearml.conf
(replacing localhost with a public hostname for the server) 🙂
AgitatedDove14 This example does not specify how to start a clearml-agent with docker such that it actually executes the task
Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI
For a hacky way you can do docker ps
and see the docker run command. I believe it contains the task id, so you can grep by task id
Definitely not, the machine has 5 TB and is a recent clear install
Looking through history I found this link: None
Tldr: ClearML doesn’t support lightning
, but supports pytorch_lightning
. Downgrading from the new interface to the old one fixed my issue
(But in venv mode is also hangs the same way)