I'm having a hard time with git cloning + cache for a private repo accessed via personal access token. This happens 100% of the time, across both bitbucket + github.
I have a simple "hello world" task in a private repo.
The worker is running in a docker container called worker
built from this Dockerfile
:
FROM python:3.10.10
RUN useradd -u 1000 -ms /bin/bash user
RUN apt-get update \
&& apt-get install -yqq \
graphviz \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/
RUN pip install clearml-agent # optional
WORKDIR /home/user
ADD entrypoint.sh /home/user/entrypoint.sh
RUN chmod +x /home/user/entrypoint.sh
RUN chown user:user /home/user/entrypoint.sh
USER user
ENV PATH=/home/user/.local/bin:$PATH
CMD "./entrypoint.sh"
where entrypoint.sh
is a modified version of the default one from agent-services:
#!/bin/sh +x
if [ -n "$SHUTDOWN_IF_NO_ACCESS_KEY" ] && [ -z "$CLEARML_API_ACCESS_KEY" ] && [ -z "$TRAINS_API_ACCESS_KEY" ]; then
echo "CLEARML_API_ACCESS_KEY was not provided, service will not be started"
exit 0
fi
export CLEARML_WORKER_ID=${CLEARML_WORKER_ID:-$HOSTNAME}
export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-$TRAINS_FILES_HOST}
if [ -z "$CLEARML_FILES_HOST" ]; then
CLEARML_HOST_IP=${CLEARML_HOST_IP:-${TRAINS_HOST_IP:-$(curl -s
fi
export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-${TRAINS_FILES_HOST:-"http://$CLEARML_HOST_IP:8081"}}
export CLEARML_WEB_HOST=${CLEARML_WEB_HOST:-${TRAINS_WEB_HOST:-"http://$CLEARML_HOST_IP:8080"}}
export CLEARML_API_HOST=${CLEARML_API_HOST:-${TRAINS_API_HOST:-"http://$CLEARML_HOST_IP:8008"}}
echo $CLEARML_FILES_HOST $CLEARML_WEB_HOST $CLEARML_API_HOST 1>&2
# DAEMON_OPTIONS=${CLEARML_AGENT_DAEMON_OPTIONS:---services-mode --create-queue}
DAEMON_OPTIONS=""
QUEUES=${CLEARML_AGENT_QUEUES:-services}
if [ -z "$CLEARML_AGENT_NO_UPDATE" ]; then
if [ -n "$CLEARML_AGENT_UPDATE_REPO" ]; then
python3 -m pip install -q -U $CLEARML_AGENT_UPDATE_REPO
else
python3 -m pip install -q -U "clearml-agent${CLEARML_AGENT_UPDATE_VERSION:-$TRAINS_AGENT_UPDATE_VERSION}"
fi
fi
clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
docker-compose.yml
notice: no volume mounts. new container = completely fresh state
version: "3.6"
x-worker_template: &worker_defaults
image: worker
cpu_count: 2
deploy:
restart_policy:
condition: on-failure
privileged: true
env_file: .env
services:
worker_01:
<<: *worker_defaults
container_name: worker01
environment:
CLEARML_WORKER_ID: "01hn23k9rr7zysp3scjbwhrppg-worker-01"
in default worker mode this is what happens:
(first execution): clones repo just fine, happily completes task
(second execution): always throws the following error because it's now trying to using vcs-cache
repository = git@github.com:michael-build/nucleus-clearml.git
branch = main
version_num =
tag =
docker_cmd = python:3.10.10 --env-file=/root/.clearml/.env
entry_point = task_hello_world.py
working_dir = tasks
::: Using Cached environment /home/user/.clearml/venvs-cache/a61d870d71a2b3c4ca7f2a5a617a1242 :::
Using cached repository in "/home/user/.clearml/vcs-cache/nucleus-clearml.git.7a0bc5a5f52a1660a796b73c0d9ca015/nucleus-clearml.git"
fatal: could not read Username for '
': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='git@github.com:michael-build/nucleus-clearml.git', branch='main', commit_id='', tag='', docker_cmd='python:3.10.10 --env-file=/root/.clearml/.env', entry_point='task_hello_world.py', working_dir='tasks')
2) Check if remote-worker has valid credentials [see worker configuration file]
the credentials are definitely valid, and the Task (in web UI) points to "Latest commit in main branch". Again, this happens consistently with both bitbucket and github, so it appears related to git
entirely.