Hey, I tried doing that but sadly it doesn't seem to work. As suggested by the ECR docs, I added:aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ECR URI>To the extra_vm_bash_script in the config file. I even added a docker pull which I think worked (because it took much longer for the instances to spin up), but I still got the same error message 😞 Is there any way to debug these sessions through clearml? Thanks!
Update: got the same error while trying to clone a public repo: git@gitlab.com:gitlab-org/gitlab-foss.git
Those variables are not passed to the remote instance they are used by the aws autoscaler to launch it, but there is no need to pass them.
I think the easiest is to add them to the "extra_vm_bash_script" as well
No, I use an SSH connection which worked with the regular clearml-agent , we prefer to work with SSH instead of creating a git user.
Hi CleanPigeon16 , yes it is.
You can just write the same as you do in your ~/clearml.conf file, for example:
agent.force_git_ssh_protocol = true
Is there any way to debug these sessions through clearml? Thanks!
Yes this is a real problem, AWS does not allow to get the data very easily...
Can you check the AWS console, see what you have there ?
In theory this should have worked.
Maybe we you are missing some escaping for the "extra_vm_bash_script" ?
I'm hoping the console output will tell us
So apparently the NVIDIA AMI https://aws.amazon.com/marketplace/pp/prodview-e7zxdqduz4cbs
doesn't have the aws-cli installed. So I install it in the extra_vm_bash_script and now it wants a configuration. Is there any way to get that from the ENV vars you create? Do you think I should create my own AMI just for this?
Hi CleanPigeon16
You need to pass the private repository docker credentials to the aws instance, I would use the custom bash script option of the aws autoscaler to create the docker credentials file.
Update 2: it works with the public repo using https: https://gitlab.com/gitlab-org/gitlab-foss.git but not with the private one, withfatal: could not read Username for ' ': terminal prompts disabled
I did not, I see that there's a field for extra_trains_conf , but couldn't find clear documentation on how to use it. Is it just a reference to a trains_conf (maybe clearml_conf ?)?
it's in the docker image, doesn't the git clone command run in the container?
Hey AgitatedDove14 thanks, that works! The docker is now up and running, great success.
I have a follow up, maybe you can help debug. Now for some reason git clone doesn't work through the agent, but if I login myself to the machine and run the same command I see that fails in the log it works. The error I see is:
` cloning: git@gitlab.com:<repo_path>
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Repository cloning failed: Command '['clone', 'git@gitlab.com:<repo_path>', '/root/.clearml/vcs-cache/algo.git.79829419d47144c686928c19e208f770/algo.git', '--quiet', '--recursive']' returned non-zero exit status 128.
clearml_agent: ERROR: Failed cloning repository. And I create a container with: docker run -it <paste options from clearml UI> <docker image from clearml UI> /bin/bash and then running: git clone git@gitlab.com:<repo_path> /root/.clearml/vcs-cache/algo.git.79829419d47144c686928c19e208f770/algo.git --quiet --recursive `works like a charm. Any suggestions? What am I missing (the docker image we build has the SSH key in it)
Hi CleanPigeon16
I think now the issue is missing git credentials, did you pass git_user / git_pass to the AWS autoscaler ?
it's in the docker image, doesn't the git clone command run in the container
Then this should have worked.
Did you pass in the configuration: force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/e93384b99bdfd72a54cf2b68b3991b145b504b79/docs/clearml.conf#L25
Then you have to pass the .ssh into the remote server, probably the easiest is to have it in the "extra bash script"