I'M Having A Hard Time With Git Cloning + Cache For A Private Repo Accessed Via Personal Access Token. This Happens 100% Of The Time, Across Both Bitbucket + Github. I Have A Simple "Hello World" Task In A Private Repo. The Worker Is Running In A Docker

Answered

I'm having a hard time with git cloning + cache for a private repo accessed via personal access token. This happens 100% of the time, across both bitbucket + github.

I have a simple "hello world" task in a private repo.
The worker is running in a docker container called worker built from this Dockerfile :

FROM python:3.10.10
RUN useradd -u 1000 -ms /bin/bash user
RUN apt-get update \
    && apt-get install -yqq \
	   graphviz \
	&& apt-get clean \
	&& rm -rf /var/lib/apt/lists/

RUN pip install clearml-agent  # optional
WORKDIR /home/user

ADD entrypoint.sh /home/user/entrypoint.sh
RUN chmod +x /home/user/entrypoint.sh
RUN chown user:user /home/user/entrypoint.sh
USER user
ENV PATH=/home/user/.local/bin:$PATH
CMD "./entrypoint.sh"

where entrypoint.sh is a modified version of the default one from agent-services:

#!/bin/sh +x

if [ -n "$SHUTDOWN_IF_NO_ACCESS_KEY" ] && [ -z "$CLEARML_API_ACCESS_KEY" ] && [ -z "$TRAINS_API_ACCESS_KEY" ]; then
  echo "CLEARML_API_ACCESS_KEY was not provided, service will not be started"
  exit 0
fi

export CLEARML_WORKER_ID=${CLEARML_WORKER_ID:-$HOSTNAME}
export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-$TRAINS_FILES_HOST}

if [ -z "$CLEARML_FILES_HOST" ]; then
    CLEARML_HOST_IP=${CLEARML_HOST_IP:-${TRAINS_HOST_IP:-$(curl -s


fi

export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-${TRAINS_FILES_HOST:-"http://$CLEARML_HOST_IP:8081"}}
export CLEARML_WEB_HOST=${CLEARML_WEB_HOST:-${TRAINS_WEB_HOST:-"http://$CLEARML_HOST_IP:8080"}}
export CLEARML_API_HOST=${CLEARML_API_HOST:-${TRAINS_API_HOST:-"http://$CLEARML_HOST_IP:8008"}}

echo $CLEARML_FILES_HOST $CLEARML_WEB_HOST $CLEARML_API_HOST 1>&2

# DAEMON_OPTIONS=${CLEARML_AGENT_DAEMON_OPTIONS:---services-mode --create-queue}
DAEMON_OPTIONS=""
QUEUES=${CLEARML_AGENT_QUEUES:-services}

if [ -z "$CLEARML_AGENT_NO_UPDATE" ]; then
  if [ -n "$CLEARML_AGENT_UPDATE_REPO" ]; then
    python3 -m pip install -q -U $CLEARML_AGENT_UPDATE_REPO
  else
    python3 -m pip install -q -U "clearml-agent${CLEARML_AGENT_UPDATE_VERSION:-$TRAINS_AGENT_UPDATE_VERSION}"
  fi
fi

clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}

docker-compose.yml
notice: no volume mounts. new container = completely fresh state

version: "3.6"

x-worker_template: &worker_defaults
  image: worker
  cpu_count: 2
  deploy:
    restart_policy:
      condition: on-failure
  privileged: true
  env_file: .env

services:
  worker_01:
    <<: *worker_defaults
    container_name: worker01
    environment:
      CLEARML_WORKER_ID: "01hn23k9rr7zysp3scjbwhrppg-worker-01"

in default worker mode this is what happens:
(first execution): clones repo just fine, happily completes task
(second execution): always throws the following error because it's now trying to using vcs-cache

repository = git@github.com:michael-build/nucleus-clearml.git
branch = main
version_num = 
tag = 
docker_cmd = python:3.10.10 --env-file=/root/.clearml/.env
entry_point = task_hello_world.py
working_dir = tasks
::: Using Cached environment /home/user/.clearml/venvs-cache/a61d870d71a2b3c4ca7f2a5a617a1242 :::
Using cached repository in "/home/user/.clearml/vcs-cache/nucleus-clearml.git.7a0bc5a5f52a1660a796b73c0d9ca015/nucleus-clearml.git"
fatal: could not read Username for '

': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository. 
1) Make sure you pushed the requested commit:
(repository='git@github.com:michael-build/nucleus-clearml.git', branch='main', commit_id='', tag='', docker_cmd='python:3.10.10 --env-file=/root/.clearml/.env', entry_point='task_hello_world.py', working_dir='tasks')
2) Check if remote-worker has valid credentials [see worker configuration file]

the credentials are definitely valid, and the Task (in web UI) points to "Latest commit in main branch". Again, this happens consistently with both bitbucket and github, so it appears related to git entirely.

  				
Posted 
	11 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

Votes Newest

Answers 14

The clone is the default used by git (you can actually see the command in the log)

  				
Posted 
	10 months ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

yes i actually have been able to turn on caching after rc2 of the agent! been working much better .

  				
Posted 
	10 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

I can see agent.vcs_cache.enabled = true as a printout in the Console, but cannot find docs on how to set this via environment variable, since I'm trying to keep these containers from needing a clearml.conf file (though I can generate on in the entrypoint script if need be with <EOF> )

  				
Posted 
	11 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

and for what its worth it seems I dont have anything special for agent cloning

i did find agent.vcs_cache.clone_on_pull_fail to be helpful . but yah, updating the agent was the biggest fix

  				
Posted 
	9 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

i ended up pinning the Dockerfile instruction to 1.18 but before that was letting the entrypoint script do the install (so, latest) .

much appreciate the env var tip . that's more elegant than what i did .

since I've turned off caching I've had much better luck . is what I'm experiencing a bug? (bitbucket nor github private repository work on second task per worker)

  				
Posted 
	11 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

so far it seems that turning off cache like this is my "best option"

  				
Posted 
	11 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

SmallTurkey79 did you solved this issue with fatal: could not read Username ?

  				
Posted 
	9 months ago

					More  		
  Report
		
					MinuteMouse44
				
					0
					 × 1

BTW a new agent version has been released, I'd recommend trying it out

  				
Posted 
	10 months ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi SmallTurkey79 , indeed, you can turn it off by passing this configuration in the config file ( agent.vcs_cache.enabled: false will also work). By using dynamic env vars, you can also use this env var to set the same value: CLEARML_AGENT__AGENT__VCS_CACHE__ENABLED=false (see here for more details)

  				
Posted 
	11 months ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Okay thank you so much
But I think I solve problem with credentials by using clearml_agent v1.8.1rc2
But now I get an issue with local python modules 🫠
Even when I set

agent.skip_pip_venv_install = 1
agent.skip_python_env_install = /usr/bin/python

In worker logs I see:

Environment setup completed successfully
Starting Task Execution:

  				
Posted 
	9 months ago

					More  		
  Report
		
					MinuteMouse44
				
					0
					 × 1

yeah i ended up figuring it out . i think we are in similar situations (private git repo w token) . ill take a look at my config tomorrow but from memory, you have to set your env variables and have an option in your config to force https protocol if you're using a token .

  				
Posted 
	9 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

update: ever since turning off git caching, i've had much more stability. i cannot tell whether it's causing a slow down in task execution though - is the clone a shallow one by default?

  				
Posted 
	10 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

so, i got around this with env vars

in my worker entrypoint script , I do

export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)

  				
Posted 
	9 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

By the way, which agent version are you using? Can you include the complete task log?

  				
Posted 
	11 months ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

916 Views

14 Answers

11 months ago

9 months ago