I Have Weird Issue With Clearml Agent: When Queue A Job For A Second Time On The Same Agent, It Get

 Executing task id [43cc0c9e1f794f53a148bde3fff03cc9]:
repository = git@github.com:REDATED.git
branch = main
version_num = f2e50dce4c13dace2adf19cbe1ed3368f8c60fe8
tag = 
docker_cmd = 
entry_point = train_ic.py
working_dir = modelreproduce
Python interpreter /usr/bin/python3.10 is set from environment var
Using cached repository in "/root/.clearml/vcs-cache/REDATED.git.66005d42a25ba4528e645d153dd73cb3/REDATED.git"
fatal: could not read Username for '
': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository. 
1) Make sure you pushed the requested commit:
(repository='git@github.com:REDATED.git', branch='main', commit_id='f2e50dce4c13dace2adf19cbe1ed3368f8c60fe8', tag='', docker_cmd=None, entry_point='train_ic.py', working_dir='modelreproduce')
2) Check if remote-worker has valid credentials [see worker configuration file]

Our clearml agent is running inside a container.
We use PAT to auth.
If we clear the cache folder /root/.clearml/vcs-cache/ then it work again by simply doing a normal clone with credential via PAT.
It feel like when using cache folder, it expect the git auth to be saved/cache ?? But for some reason it is not ?
clearml_agent v1.7.0,

Posted 2 months ago
will do

Posted 2 months ago

Hi @<1576381444509405184:profile|ManiacalLizard2> , we might have identified the specific issue, and a new agent version v 1.8.1rc2 is out with a fix - can you please try it out and let us know if it resolves the issue?

Posted 2 months ago

Hi @<1576381444509405184:profile|ManiacalLizard2> , can you try with the latest agent v1.8.0 and making sure the cache is cleaned before you try again?

Posted 2 months ago

I will try it. But it's a bit random when this happen so ... We will see

Posted 2 months ago

@<1523701087100473344:profile|SuccessfulKoala55> I can confirm that v1.8.1rc2 fixed the issue in our case. I manage to reproduce it:

  • Do a local commit without pushing
  • Create task and queue it
  • The queue task failed as expected as the commit is only local
  • Push your local commit
  • Requeue the task
  • Expecting that the task succeeed as the commit is avail: but it fails as the vcs seems to be in weird state from previous failure
  • Now with v1.8.1rc2 the issue is solved
Posted one month ago
