Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
I'M Having A Hard Time With Git Cloning + Cache For A Private Repo Accessed Via Personal Access Token. This Happens 100% Of The Time, Across Both Bitbucket + Github. I Have A Simple "Hello World" Task In A Private Repo. The Worker Is Running In A Docker

I'm having a hard time with git cloning + cache for a private repo accessed via personal access token. This happens 100% of the time, across both bitbucket + github.

I have a simple "hello world" task in a private repo.
The worker is running in a docker container called worker built from this Dockerfile :

FROM python:3.10.10
RUN useradd -u 1000 -ms /bin/bash user
RUN apt-get update \
    && apt-get install -yqq \
	   graphviz \
	&& apt-get clean \
	&& rm -rf /var/lib/apt/lists/

RUN pip install clearml-agent  # optional
WORKDIR /home/user

ADD entrypoint.sh /home/user/entrypoint.sh
RUN chmod +x /home/user/entrypoint.sh
RUN chown user:user /home/user/entrypoint.sh
USER user
ENV PATH=/home/user/.local/bin:$PATH
CMD "./entrypoint.sh"

where entrypoint.sh is a modified version of the default one from agent-services:

#!/bin/sh +x

if [ -n "$SHUTDOWN_IF_NO_ACCESS_KEY" ] && [ -z "$CLEARML_API_ACCESS_KEY" ] && [ -z "$TRAINS_API_ACCESS_KEY" ]; then
  echo "CLEARML_API_ACCESS_KEY was not provided, service will not be started"
  exit 0
fi

export CLEARML_WORKER_ID=${CLEARML_WORKER_ID:-$HOSTNAME}
export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-$TRAINS_FILES_HOST}

if [ -z "$CLEARML_FILES_HOST" ]; then
    CLEARML_HOST_IP=${CLEARML_HOST_IP:-${TRAINS_HOST_IP:-$(curl -s 

fi

export CLEARML_FILES_HOST=${CLEARML_FILES_HOST:-${TRAINS_FILES_HOST:-"http://$CLEARML_HOST_IP:8081"}}
export CLEARML_WEB_HOST=${CLEARML_WEB_HOST:-${TRAINS_WEB_HOST:-"http://$CLEARML_HOST_IP:8080"}}
export CLEARML_API_HOST=${CLEARML_API_HOST:-${TRAINS_API_HOST:-"http://$CLEARML_HOST_IP:8008"}}

echo $CLEARML_FILES_HOST $CLEARML_WEB_HOST $CLEARML_API_HOST 1>&2

# DAEMON_OPTIONS=${CLEARML_AGENT_DAEMON_OPTIONS:---services-mode --create-queue}
DAEMON_OPTIONS=""
QUEUES=${CLEARML_AGENT_QUEUES:-services}

if [ -z "$CLEARML_AGENT_NO_UPDATE" ]; then
  if [ -n "$CLEARML_AGENT_UPDATE_REPO" ]; then
    python3 -m pip install -q -U $CLEARML_AGENT_UPDATE_REPO
  else
    python3 -m pip install -q -U "clearml-agent${CLEARML_AGENT_UPDATE_VERSION:-$TRAINS_AGENT_UPDATE_VERSION}"
  fi
fi

clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}

docker-compose.yml
notice: no volume mounts. new container = completely fresh state

version: "3.6"

x-worker_template: &worker_defaults
  image: worker
  cpu_count: 2
  deploy:
    restart_policy:
      condition: on-failure
  privileged: true
  env_file: .env

services:
  worker_01:
    <<: *worker_defaults
    container_name: worker01
    environment:
      CLEARML_WORKER_ID: "01hn23k9rr7zysp3scjbwhrppg-worker-01"

in default worker mode this is what happens:
(first execution): clones repo just fine, happily completes task
(second execution): always throws the following error because it's now trying to using vcs-cache

repository = git@github.com:michael-build/nucleus-clearml.git
branch = main
version_num = 
tag = 
docker_cmd = python:3.10.10 --env-file=/root/.clearml/.env
entry_point = task_hello_world.py
working_dir = tasks
::: Using Cached environment /home/user/.clearml/venvs-cache/a61d870d71a2b3c4ca7f2a5a617a1242 :::
Using cached repository in "/home/user/.clearml/vcs-cache/nucleus-clearml.git.7a0bc5a5f52a1660a796b73c0d9ca015/nucleus-clearml.git"
fatal: could not read Username for '
': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository. 
1) Make sure you pushed the requested commit:
(repository='git@github.com:michael-build/nucleus-clearml.git', branch='main', commit_id='', tag='', docker_cmd='python:3.10.10 --env-file=/root/.clearml/.env', entry_point='task_hello_world.py', working_dir='tasks')
2) Check if remote-worker has valid credentials [see worker configuration file]

the credentials are definitely valid, and the Task (in web UI) points to "Latest commit in main branch". Again, this happens consistently with both bitbucket and github, so it appears related to git entirely.

  
  
Posted 7 months ago
Votes Newest

Answers 14


yes i actually have been able to turn on caching after rc2 of the agent! been working much better .

  
  
Posted 7 months ago

I can see agent.vcs_cache.enabled = true as a printout in the Console, but cannot find docs on how to set this via environment variable, since I'm trying to keep these containers from needing a clearml.conf file (though I can generate on in the entrypoint script if need be with <EOF> )

  
  
Posted 7 months ago

@<1689446563463565312:profile|SmallTurkey79> did you solved this issue with fatal: could not read Username ?

  
  
Posted 5 months ago

Okay thank you so much
But I think I solve problem with credentials by using clearml_agent v1.8.1rc2
But now I get an issue with local python modules 🫠
Even when I set

agent.skip_pip_venv_install = 1
agent.skip_python_env_install = /usr/bin/python

In worker logs I see:

Environment setup completed successfully
Starting Task Execution:
  
  
Posted 5 months ago

yeah i ended up figuring it out . i think we are in similar situations (private git repo w token) . ill take a look at my config tomorrow but from memory, you have to set your env variables and have an option in your config to force https protocol if you're using a token .

  
  
Posted 5 months ago

By the way, which agent version are you using? Can you include the complete task log?

  
  
Posted 7 months ago

so, i got around this with env vars

in my worker entrypoint script , I do

export CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=$(which python)

  
  
Posted 5 months ago

so far it seems that turning off cache like this is my "best option"
image

  
  
Posted 7 months ago

Hi @<1689446563463565312:profile|SmallTurkey79> , indeed, you can turn it off by passing this configuration in the config file ( agent.vcs_cache.enabled: false will also work). By using dynamic env vars, you can also use this env var to set the same value: CLEARML_AGENT__AGENT__VCS_CACHE__ENABLED=false (see here for more details)

  
  
Posted 7 months ago

The clone is the default used by git (you can actually see the command in the log)

  
  
Posted 7 months ago

BTW a new agent version has been released, I'd recommend trying it out

  
  
Posted 7 months ago

and for what its worth it seems I dont have anything special for agent cloning

i did find agent.vcs_cache.clone_on_pull_fail to be helpful . but yah, updating the agent was the biggest fix

  
  
Posted 5 months ago

update: ever since turning off git caching, i've had much more stability. i cannot tell whether it's causing a slow down in task execution though - is the clone a shallow one by default?

  
  
Posted 7 months ago

i ended up pinning the Dockerfile instruction to 1.18 but before that was letting the entrypoint script do the install (so, latest) .

much appreciate the env var tip . that's more elegant than what i did .

since I've turned off caching I've had much better luck . is what I'm experiencing a bug? (bitbucket nor github private repository work on second task per worker)

  
  
Posted 7 months ago
633 Views
14 Answers
7 months ago
5 months ago
Tags