Hi @<1523701205467926528:profile|AgitatedDove14> thanks!
I talked with my boss and i could install clearml-agent directly in the training machine
But now when I try to run an experiment using clearml-agent daemon --gpus 0 --queue default --foreground --docker
It gets stall in this part
Installing collected packages: attrs, rpds-py, zipp, importlib-resources, referencing, jsonschema-specifications, pkgutil-resolve-name, jsonschema, psutil, six, filelock, distlib, platformdirs, virtualenv, python-dateutil, pyjwt, orderedmultidict, furl, urllib3, charset-normalizer, certifi, idna, requests, PyYAML, pyparsing, pathlib2, clearml-agent
Successfully installed PyYAML-6.0.2 attrs-23.2.0 certifi-2024.8.30 charset-normalizer-3.3.2 clearml-agent-1.9.1 distlib-0.3.8 filelock-3.16.0 furl-2.1.3 idna-3.8 importlib-resources-6.4.4 jsonschema-4.23.0 jsonschema-specifications-2023.12.1 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 pkgutil-resolve-name-1.3.10 platformdirs-4.3.2 psutil-5.9.8 pyjwt-2.8.0 pyparsing-3.1.4 python-dateutil-2.8.2 referencing-0.35.1 requests-2.31.0 rpds-py-0.20.0 six-1.16.0 urllib3-1.26.20 virtualenv-20.26.4 zipp-3.20.1
WARNING: You are using pip version 20.1.1; however, version 24.2 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
And it doesn't continue
This is the docker i created and it is not working
FROM pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
libgl1-mesa-glx \
libglib2.0-0 \
git \
&& rm -rf /var/lib/apt/lists/*
# Install dependencies
COPY requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/requirements.txt
# Update CA certificates
COPY hme_root_CA.crt /usr/local/share/ca-certificates/company_root_CA.crt
ENV REQUESTS_CA_BUNDLE=/usr/local/share/ca-certificates/company_root_CA.crt
RUN update-ca-certificates
RUN pip config set global.cert /usr/local/share/ca-certificates/company_root_CA.crt
RUN pip install --trusted-host pypi.python.org --trusted-host files.pythonhosted.org --trusted-host pypi.org --upgrade pip
# Configure ClearML
COPY clearml.conf /root/clearml.conf
RUN clearml-init
# Start ClearML worker
CMD [ "clearml-agent", "daemon", "--queue", "default", "--foreground" ]
I can install clearml and clearml-agemt and run the worker inside a docker
oh I see, you should install it inside a docker, then mount the docker socket so it can spin sibling containers , ans lastly make sure the mounts are correct with this env variable:
None
Ho @<1739818374189289472:profile|SourSpider22>
What are you trying to install, just the agent? if so pip install clearml-agent
is all you need
Hi @<1523701205467926528:profile|AgitatedDove14> at the end we make it works
It has a lot of warning but can run the experiments hahaha
Thanks for your help
──────────────────────────────────────────────────────────────────────────────────────────────────────────
sdk.apply_environment = false [50/1825]
sdk.apply_files = false
Executing task id [6e73e0db9cb14aa8b91e0a5439a5aac0]:
repository =
branch =
version_num =
tag =
docker_cmd =
entry_point = train_clearml.py
working_dir = .
created virtual environment CPython3.10.13.final.0-64 in 151ms
creator CPython3Posix(dest=/root/.clearml/venvs-builds/3.10, clear=False, no_vcs_ignore=False, global=Tr
ue)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/
root/.local/share/virtualenv)
added seed packages: pip==24.1, setuptools==70.1.0, wheel==0.43.0
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActiva
tor
Ignoring pip: markers 'python_version < "3.10"' don't match your environment
Ignoring pip: markers 'python_version >= "3.12"' don't match your environment
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection b
roken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
: unable to get local issuer certificate (_ssl.c:1007)'))': /simple/pip/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection b
roken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
: unable to get local issuer certificate (_ssl.c:1007)'))': /simple/pip/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection b
roken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
: unable to get local issuer certificate (_ssl.c:1007)'))': /simple/pip/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection b
roken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
: unable to get local issuer certificate (_ssl.c:1007)'))': /simple/pip/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection b
roken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
: unable to get local issuer certificate (_ssl.c:1007)'))': /simple/pip/
Could not fetch URL None : There was a problem confirming the ssl certificate: HTTP
SConnectionPool(host=' pypi.org ', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLErro
r(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get l
ocal issuer certificate (_ssl.c:1007)'))) - skipping
ERROR: Could not find a version that satisfies the requirement pip<22.3 (from versions: none)
ERROR: No matching distribution found for pip<22.3
clearml_agent: ERROR: Command '['/root/.clearml/venvs-builds/3.10/bin/python', '-m', 'pip', '--disable-pip
-version-check', 'install', "pip<20.2 ; python_version < '3.10'", "pip<22.3 ; python_version >= '3.10' and
python_version <= '3.11'", "pip>=23,<24.3 ; python_version >= '3.12'", '--upgrade']' returned non-zero exit status 1.
Hi @<1739818374189289472:profile|SourSpider22>
could you send the entire console log? maybe there is a hint somewhere there?
(basically what happens after that is the agent is supposed to be running from inside the container, but maybe it cannot access the clearml-server for some reason)