GentleSwallow91 how come it does not already find the correct pytorch version inside the docker ? whats the clearml-agent version you are using ?
hmm that is odd, it should have detected it, can you verify the issue still exists with the latest RC?pip3 install clearml-agent==1.2.4rc3
and this is inside a container to check that package is installed:docker run -it --rm torch2022 pip show torch
Name: torch Version: 1.11.0 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page:
Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /opt/conda/lib/python3.8/site-packages Requires: typing_extensions Required-by: torchmetrics, pytorch-lightning, torchvision, torchtext, torchelastic
I build my own image on top of pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel
This time it runs smoothly - here's the output:
` Local file not found [torch @ file:///home/testuser/.clearml/pip-download-cache/cu113/torch-1.11.0%2Bcu113-cp39-cp39-linux_x86_64.whl], references removed
Local file not found [torchvision @ file:///home/testuser/.clearml/pip-download-cache/cu113/torchvision-0.12.0%2Bcu113-cp39-cp39-linux_x86_64.whl], references removed
Adding venv into cache: /home/nino/.clearml/venvs-builds/3.9
Running task id [b15553c045ab4c3283bbdb040ec19f1f]:
[src/models]$ /home/testuser/.clearml/venvs-builds/3.9/bin/python -u train.py
Summary - installed python packages:
pip:
...
- torch @ file:///home/testuser/.clearml/pip-download-cache/cu113/torch-1.11.0%2Bcu113-cp39-cp39-linux_x86_64.whl
- torchmetrics==0.8.2
- torchvision @ file:///home/testuser/.clearml/pip-download-cache/cu113/torchvision-0.12.0%2Bcu113-cp39-cp39-linux_x86_64.whl
...
Environment setup completed successfully
Starting Task Execution: `
Hi AgitatedDove14
Thanks for the update.
Well, it's a pain... I use specifically pytorch docker image and still agent will download it?
My image is build based on FROM pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel
And a portion of agent log on top of that image:Package(s) not found: torch Torch CUDA 113 download page found Found PyTorch version torch==1.11.0 matching CUDA version 113 Package(s) not found: torchvision Found PyTorch version torchvision==0.12.0 matching CUDA version 113 Collecting torch==1.11.0+cu113 Downloading
(1637.0 MB)
So there is no way to change that?
Woot woot!
awesome, this RC is stable you can feel free to use it, the official release is probably due to be out next week :)
Hi GentleSwallow91
I am very much concerned with docker container spin up time.
To accelerate spin up time (mostly pip install) use the venv cahing (basically it will store a cache of the entire installed venv so it oes not need to reinstall it)
Unmark this line:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L116
The problem above could be that I used a non-root user to train a model and all packages are installed for non-root user but clearML agent runs container as a root user.
You can specify a user access folder instead of the "/root/" home folder here:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L241
clearml-agent --version CLEARML-AGENT version 1.2.3