Reputation
Badges 1
70 × Eureka!This happens inside the agent, since I use task.execute_remotely() I guess. The agent runs on ubuntu 18.04 and not in docker mode
using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run...
My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__":
as seen above in the code snippet.
RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
I'm running now the the code shown above and will let you know if there is still an issue
clearml_agent v1.0.0 and clearml v1.0.2
Hey AgitatedDove14 , I fixed my code issue and are now able to train on multiple gpus using the https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py . Since I create the ClearML Task in the main thread I now can't see any training plots and probably also not the output model. What would be the right approach? I would like to avoid using Task.current_task().upload_artifact()
or manual logging. I really enjoy the automatic detection
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I added it to this script and use it as a starting point https://github.com/facebookresearch/fastMRI/bl...
I'm running the following agent:clearml-agent --config-file /clearml-cache/config/clearml-cpu.conf daemon --queue cpu default services --docker ubuntu:20.04 --cpu-only --services-mode 4 --detached
The goal is to have an agent that can run multiple cpu only tasks at the same time. I notices that when enqueueing multiple tasks, all except for one stay pending
until the first one finished downloading all packages and started with code execution. And then task by task switch to "run...
Actually I saw that the RuntimeError: context has already been set
appears when the task is initialised outside if __name__ == "__main__":
but this would be still part of the clearml.conf right? I would prefer a way around clearml.conf to avoid resolving the variables
We run a lot of pipelines that are cpu only with some parallel steps. Its just about improving the execution time
I can figure out a way to resolve it, but is there any other way to get env vars / any value or secret from the host to the docker of a task?
I like this approach more but it still requires resolved environment variables inside the clearml.conf
The strange thing was that my agents where running in the morning but just disappeared in the clearml server ui under workers-and-queues . So I did docker-compose down / up and then I got this error.
It is working now, it seemed like I pointed to a wrong entrypoint.sh
in the docker-compose file. Still strange...
We do have a queue called office and another queue called default, so the agent is not listening for queues that are not defined. Or do I misunderstand something? The server has all queues defined that the agents are using
docker-compose with entrypoint.sh with python3 -m clearml_agent daemon --docker
"${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}"
--force-current-version
${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
--queue office
I increased already the memory to 8GB after reading similar issues here on the slack`
Just making sure, how exactly did you do that?
docker-compose down
elasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g `` docker-compose up -d
Try to restart ES and see if it helps
docker-compose down / up does not help
, what version of clearml is your server?
the docker-compose use clearml:latest
Very good news!
` root@ubuntu:/opt/clearml# sudo docker logs clearml-elastic
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
{"type": "server", "timestamp": "2021-11-09T12:49:13,403Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (//some_ip/clearml-server-data)]], net usable_space [3.4tb]...
 so you say deleting other old indices that I don't need could help?
This did not help, I still have the same issue