Hi GentleSwallow91
I am very much concerned with docker container spin up time.
To accelerate spin up time (mostly pip install) use the venv cahing (basically it will store a cache of the entire installed venv so it oes not need to reinstall it)
Unmark this line:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L116
The problem above could be that I used a non-root user to train a model and all packages are installed for non-root user but clearML agent runs container as a root user.
You can specify a user access folder instead of the "/root/" home folder here:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L241
clearml-agent --version CLEARML-AGENT version 1.2.3
and this is inside a container to check that package is installed:docker run -it --rm torch2022 pip show torch
Name: torch Version: 1.11.0 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page:
Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /opt/conda/lib/python3.8/site-packages Requires: typing_extensions Required-by: torchmetrics, pytorch-lightning, torchvision, torchtext, torchelastic
I build my own image on top of pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel
GentleSwallow91 how come it does not already find the correct pytorch version inside the docker ? whats the clearml-agent version you are using ?
Hi AgitatedDove14
Thanks for the update.
Well, it's a pain... I use specifically pytorch docker image and still agent will download it?
My image is build based on FROM pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel
And a portion of agent log on top of that image:Package(s) not found: torch Torch CUDA 113 download page found Found PyTorch version torch==1.11.0 matching CUDA version 113 Package(s) not found: torchvision Found PyTorch version torchvision==0.12.0 matching CUDA version 113 Collecting torch==1.11.0+cu113 Downloading
(1637.0 MB)
So there is no way to change that?
This time it runs smoothly - here's the output:
` Local file not found [torch @ file:///home/testuser/.clearml/pip-download-cache/cu113/torch-1.11.0%2Bcu113-cp39-cp39-linux_x86_64.whl], references removed
Local file not found [torchvision @ file:///home/testuser/.clearml/pip-download-cache/cu113/torchvision-0.12.0%2Bcu113-cp39-cp39-linux_x86_64.whl], references removed
Adding venv into cache: /home/nino/.clearml/venvs-builds/3.9
Running task id [b15553c045ab4c3283bbdb040ec19f1f]:
[src/models]$ /home/testuser/.clearml/venvs-builds/3.9/bin/python -u train.py
Summary - installed python packages:
pip:
...
- torch @ file:///home/testuser/.clearml/pip-download-cache/cu113/torch-1.11.0%2Bcu113-cp39-cp39-linux_x86_64.whl
- torchmetrics==0.8.2
- torchvision @ file:///home/testuser/.clearml/pip-download-cache/cu113/torchvision-0.12.0%2Bcu113-cp39-cp39-linux_x86_64.whl
...
Environment setup completed successfully
Starting Task Execution: `
Woot woot!
awesome, this RC is stable you can feel free to use it, the official release is probably due to be out next week :)
hmm that is odd, it should have detected it, can you verify the issue still exists with the latest RC?pip3 install clearml-agent==1.2.4rc3