Hey AgitatedDove14 thanks, that works! The docker is now up and running, great success.
I have a follow up, maybe you can help debug. Now for some reason git clone
doesn't work through the agent, but if I login myself to the machine and run the same command I see that fails in the log it works. The error I see is:
` cloning: git@gitlab.com:<repo_path>
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Repository cloning failed: Command '['clone', 'git@gitlab.com:<repo_path>', '/root/.clearml/vcs-cache/algo.git.79829419d47144c686928c19e208f770/algo.git', '--quiet', '--recursive']' returned non-zero exit status 128.
clearml_agent: ERROR: Failed cloning repository. And I create a container with:
docker run -it <paste options from clearml UI> <docker image from clearml UI> /bin/bash and then running:
git clone git@gitlab.com:<repo_path> /root/.clearml/vcs-cache/algo.git.79829419d47144c686928c19e208f770/algo.git --quiet --recursive `works like a charm. Any suggestions? What am I missing (the docker image we build has the SSH key in it)
Hi CleanPigeon16
I think now the issue is missing git credentials, did you pass git_user / git_pass to the AWS autoscaler ?
Update 2: it works with the public repo using https: https://gitlab.com/gitlab-org/gitlab-foss.git but not with the private one, withfatal: could not read Username for '
': terminal prompts disabled
I did not, I see that there's a field for extra_trains_conf
, but couldn't find clear documentation on how to use it. Is it just a reference to a trains_conf
(maybe clearml_conf
?)?
Hi CleanPigeon16 , yes it is.
You can just write the same as you do in your ~/clearml.conf
file, for example:
agent.force_git_ssh_protocol = true
Hi CleanPigeon16
You need to pass the private repository docker credentials to the aws instance, I would use the custom bash script option of the aws autoscaler to create the docker credentials file.
it's in the docker image, doesn't the git clone command run in the container
Then this should have worked.
Did you pass in the configuration: force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/e93384b99bdfd72a54cf2b68b3991b145b504b79/docs/clearml.conf#L25
No, I use an SSH connection which worked with the regular clearml-agent
, we prefer to work with SSH instead of creating a git user.
Hey, I tried doing that but sadly it doesn't seem to work. As suggested by the ECR docs, I added:aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ECR URI>
To the extra_vm_bash_script
in the config file. I even added a docker pull
which I think worked (because it took much longer for the instances to spin up), but I still got the same error message 😞 Is there any way to debug these sessions through clearml? Thanks!
Is there any way to debug these sessions through clearml? Thanks!
Yes this is a real problem, AWS does not allow to get the data very easily...
Can you check the AWS console, see what you have there ?
In theory this should have worked.
Maybe we you are missing some escaping for the "extra_vm_bash_script" ?
I'm hoping the console output will tell us
Update: got the same error while trying to clone a public repo: git@gitlab.com:gitlab-org/gitlab-foss.git
it's in the docker image, doesn't the git clone command run in the container?
Then you have to pass the .ssh into the remote server, probably the easiest is to have it in the "extra bash script"
So apparently the NVIDIA AMI https://aws.amazon.com/marketplace/pp/prodview-e7zxdqduz4cbs
doesn't have the aws-cli
installed. So I install it in the extra_vm_bash_script
and now it wants a configuration. Is there any way to get that from the ENV vars you create? Do you think I should create my own AMI just for this?
Those variables are not passed to the remote instance they are used by the aws autoscaler to launch it, but there is no need to pass them.
I think the easiest is to add them to the "extra_vm_bash_script" as well