I confirm that I can successfully clone the repo from a newly created shell directly on the clearml agent server using the url it printed in the logs:
_company.com:1234/our_gitlab/our_repo.git
Thanks. Make sure to delete the agent's VCS cache before trying again
I'm using a self hosted instance of clearml, running on AWS using the AMI clearml-server-1.13.0-414-117
@<1658281099807166464:profile|SmallCamel52> which agent version are you using?
@<1523701087100473344:profile|SuccessfulKoala55> clearml-agent version is 1.6.1
@<1523701087100473344:profile|SuccessfulKoala55> Thank your for your advice. I have updated the clearml agent version to 1.7, cleared the cached and forced the server port (it wasn't 22) and also forced the ssh user to git. The error has changed slightly:
cloning:
_company.com:1234/our_gitlab/our_repo.git
Using SSH credentials - ssh url '
_company.com:1234/our_gitlab/our_repo.git' with ssh url '
_company.com:1234/our_gitlab/our_repo.git'
git@our_tools.our_company.com: Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights and the repository exists.
Repository cloning failed: Command '['clone', '
_company.com:1234/our_gitlab/our_repo.git', '/home/ec2-user/.clearml/vcs-cache/our_repo.git.c3e87922dd57630bb815feb7dcb4354b/our_repo.git', '--recursive', '--quiet']' returned non-zero exit status 128.
clearml_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='
_company.com:1234/our_gitlab/our_repo.git', branch='main', commit_id='f6d54eadf0108a3af243595426a710c150e14861', tag='', docker_cmd=None, entry_point='lstm_training.py', working_dir='tasks')
2) Check if remote-worker has valid credentials [see worker configuration file]
I tried to force the host, but it didn't work - for some reason it started using the agent instance user (ec2-user) instead of git.
Could it be due to the fact that our gitlab instance isn't hosted on our_tools.our_company.com:1234
but on our_tools.our_company.com:1234/our_gitlab/
?
I managed to solve the issue by debugging the agent. I found out that despite the None _company.com:1234/our_gitlab/our_repo.git
line I found out that it was actually trying to clone from the url None _company.com:1234/our_gitlab/our_repo.git
. The agent host machine therefore didn't try to use the git
user, but the session user ec2-user
resulting in a permission denied error.
I solved it by adding an entry in the agent's ~/.ssh/config
to force the use of user git
every time it tries to connect to the host where my gitlab instance is served:
Host ourtools.ourcompany.com
User git
Hostname ourtools.ourcompany.com
IdentityFile ~/.ssh/id_ourcompany