Reputation
Badges 1
16 × Eureka!Yes, and if I clone it as it is, everything works as expected, but if I clone it and clear both IMAGE
and ARGUMENTS
only IMAGE
will be parsed when such task is sent for execution. You may ask why do I need to clear something if it works fine with just cloned tasks, but the reason to do it, as we have some old template tasks added by other members of my team (before we switched to user custom image and docker running agent) with empty IMAGE
and ARGUMENTS
fields an...
I have a gpu machine in the local network where clearml-agent is running. I send tasks for execution to the queue configured on the agent (either through the UI, or through the script with lines Task.execute_remotely(queue_name=...)
running on another machine in the same network). Config file is located in the /home/{username}
folder on the machine where agent is running
Ahh, sorry about that. I have both image and arguments values in the config file:default_docker: { image: {our_custom_image_name} arguments: ["--ipc=host", "-v", "/home/{username}/clearml.conf:/workdir/clearml.conf", "-v", "/home/{username}/.ssh:/root/.ssh", "-e", "CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1"] }
And, as I said, when the Task is cloned and sent for execution in the UI (and it doesn't have anything in the IMAGE
or ARGUMENTS
fields on the execution tab initi...
Image was built with the following command docker build --build-arg PAT=$(shell echo ${PAT}) -t $(IMAGE_NAME) .
Do you think the fact that --build-arg
argument is used may be a problem here? I was thinking that ARGUMENTS
parameter is used in combination with docker run
to start a container and has nothing to do with image build arguments
But when a new task is created in the code with Task.init()
both fields are parsed correctly
I have these lines in my clearml.conf file. I can't find it on GitHub as well but in release notes it says that it was added in v1.2.1 https://github.com/allegroai/clearml-agent/releases/tag/v1.2.1
` # Specifies a custom environment setup script to be executed instead of installing a virtual environment.
# If provided, this script is executed following Git cloning. Script command may include environment variable and
# will be expanded before execution (e.g. "$CLEARML_GIT_ROOT/sc...
My previous version was 1.2.4rc3, but to be honest, I can't find this part of the config in the corresponding commit as well
Thanks! That worked. If you don't mind could you point me in the direction where I can find the commit that resolved it?
AgitatedDove14 Actually, It happens on the same machine where clearml-agent started with clearml-agent daemon --detached --queue training-rig --gpus 1 --docker
. The only difference is how I log in into machine to start the agent (as described in the message above).
When I log in over ssh using password, use the command above to start the agent and add extra "-v", "/home/{user}/.ssh:/root/.ssh"
to docker arguments and send a task to execution on this agent I see:
` 2022-07-28 16:...
I think I narrowed down the problem to the using of ssh agent forwarding or not. When I used ssh config and connected without password I had an option in my config ForwardAgent yes
, and with this enabled when I started the agent on the remote machine it didn't mount .ssh folder by default until adding "-v", "/home/{user}/.ssh:/root/.ssh"
to the arguments. So, without ssh agent forwarding everything works as expected.
Hi again SuccessfulKoala55 Sorry for a late response. Thanks for your help so far! I understand that it's a weird problem and probably it won't be resolved in this discussion but just in case I've checked the mongodb entry for the task after cloning, after clearing of the fields and after sending it for execution.
After clone:
` { "_id" : "b94657b00adf41dd971426f51d7b9373", "container" : { "image" : "{image_name}", "arguments" : "--ipc=host -v /home/{username}/clearml.conf:/workdir/clearml....
And I would see another error if I log in without the password (with the help of authorized keys) and remove this extra argument about .ssh
volume from docker command:fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. Repository cloning failed
So it's not using .ssh folder in the host user folder, until I add "-v", "/home/{user}/.ssh:/root/.ssh"
to docker arguments
So the only difference is how I log in into machine to start clear-ml (it somehow messes up the usage of .ssh folder by the training container)
At some point I was installing it from an unstable release, and probably that is when this part was added to the config but it didn't get to the proper release after that
AgitatedDove14 Sorry for mention, but wanted to ask the same question. Did you get to the bottom of the issue above? When the .ssh folder could be copied only for the first task after the agent daemon has been started but not for the following ones (it complains about not being able to create a temp copy until I restart the agent)
EnviousPanda91 Hi! Did you manage to solve the issue? I've encountered the same behavior when the agent can't create a temp copy of .ssh folder for the second and all the following tasks: Failed creating temporary copy of ~/.ssh for git credential