Reputation
Badges 1
70 × Eureka!but this would be still part of the clearml.conf right? I would prefer a way around clearml.conf to avoid resolving the variables
I like this approach more but it still requires resolved environment variables inside the clearml.conf
When using clearml-agent daemon --queue default --docker
it is running. In this case I always had some issues when adding the --gpu
flag.
I think Anna means that if artifacts and models are stored on the clearml fileserver their path will contain the IP or domain of the fileserver. If you then move the fileserver to a different host, all the urls are broken since the host changed.
SuccessfulKoala55 do you have any example? I guess a lot of people face this issue
SuccessfulKoala55 Hey, for us artifact download urls, model download urls, images in plots and debug image urls are broken. In the linked example I can see a solution for the debug images and potentially plot images but cant find the artifacts and model urls inside ES. Are those urls maybe stored inside the mongodb? Any idea where to find them?
Hi AgitatedDove14 , I get an error when running a task on my worker. I have looked into /home/user/.clearml/venvs-builds
but it is empty. Any idea why this happens? I actually don’t know what I changed to cause this issue… I’m running clearml-agent v1.0.0
clearml_agent: ERROR: Command '['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']' returned non-zero exit status 1.
yes, this one is running in venv and not docker, because I had some issues with cuda and docker. The virtualenv==20.4.6
in the requirements.txt. I think it broke after installing clearml-serving
in the same env.
My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__":
as seen above in the code snippet.
clearml_agent v1.0.0 and clearml v1.0.2
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I added it to this script and use it as a starting point https://github.com/facebookresearch/fastMRI/bl...
Actually I saw that the RuntimeError: context has already been set
appears when the task is initialised outside if __name__ == "__main__":
I'm running now the the code shown above and will let you know if there is still an issue
RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
We do have a queue called office and another queue called default, so the agent is not listening for queues that are not defined. Or do I misunderstand something? The server has all queues defined that the agents are using
It is working now, it seemed like I pointed to a wrong entrypoint.sh
in the docker-compose file. Still strange...
CostlyOstrich36 Thank you for your response, is there something like a public project roadmap?
docker-compose with entrypoint.sh with python3 -m clearml_agent daemon --docker
"${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}"
--force-current-version
${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
--queue office
The strange thing was that my agents where running in the morning but just disappeared in the clearml server ui under workers-and-queues . So I did docker-compose down / up and then I got this error.
Hi AgitatedDove14 one more question about efficient caching, is it possible to cache/share docker images between agents?
the cache on the host is mounted as nfs and the nfs server was configured to not allow the clients to do root operations
Ok it is more a docker issue, I guess it is not feasible reading the thread.
W: chown to _apt:root of directory /var/cache/apt/archives/partial failed - SetupAPTPartialDirectory (1: Operation not permitted) W: chmod 0700 of directory /var/cache/apt/archives/partial failed - SetupAPTPartialDirectory (1: Operation not permitted) Collecting pip==20.1.1