Hi, I Have Started To Receive The Following Error Message:

Answered

Hi,
I have started to receive the following error message:
"clearml_agent: ERROR: Instance with the same WORKER_ID is already running"
I believe this happens when a process spawns many (tens) of tasks.
What can I do? I need to spawn many tasks...
I'm running clearml version 1.0.4, and it is impossible to update at the moment.
Thanks a lot!
Ron

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

Votes Newest

Answers 15

Hi ArrogantBlackbird16 ,

How do you generate and run your tasks? Do you use the same flow as in the https://clear.ml/docs/latest/docs/fundamentals/agents_and_queues#agent-and-queue-workflow ? Some other automation?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Thanks for your help and quick replies.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

Hi TimelyPenguin76 ,

Making such a toy example will take a lot of effort.

For now I intend to debug it or circumvent the error with various tricks.

If it is possible to explain the cause of the error message above, or some details regarding it, I would very much appreciate it.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

This is why it is so weird!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

maybe I missed something here, each process also create an agent to run the task with?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

TimelyPenguin76 SuccessfulKoala55
Do you have any idea what may cause this?
Is it possible that different tasks created together somehow have the same identifier?
Or am I missing something obvious?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

I believe there is a single agent, single queue, for all tasks.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

ArrogantBlackbird16 when you say spawn, what exactly do you mean? Also, are you using a locally-hosted server?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi TimelyPenguin76 and SuccessfulKoala55 ,

My tasks are created by first creating many sub-processes, and then in each sub-process: initializing a task, connecting the task to some parameters, cloning the task, enqueueing the cloned task, then killing the sub-process. When I do this with just a single sub-process, everything seems to work fine. When there are many sub-processes, I get the error message ocassionally.

Yes, I use a locally hosted server (SAIPS team).

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

TimelyPenguin76 ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

ArrogantBlackbird16 the file.py is the file contains the Task.init call?
not sure I’m getting the flow, if you just want to create a template task in the system, clone and enqueue it, you can use task.execute_remotely(queue_name="my_queue", clone=True) ,can this solve the issue?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

What is an agent?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

ArrogantBlackbird16 can you send a toy example so I can reproduce it my side?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

TimelyPenguin76 Thanks for the reply.
I believe the way I start tasks is completely independent to this problem. Assuming my approach is in principle legitimate, it does not explain why I get the following error message. Note that the error only happens when I start multiple tasks. What is the cause of this error?
clearml_agent: ERROR: Instance with the same WORKER_ID [algo-lambda:gpu0] is already running

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

To create each subprocess, I use the following:

import subprocess from copy import copy new_env = copy(os.environ) new_env.pop('TRAINS_PROC_MASTER_ID', None) new_env.pop('TRAINS_TASK_ID', None) new_env.pop('CLEARML_PROC_MASTER_ID', None) new_env.pop('CLEARML_TASK_ID', None) subprocess.Popen(cmd, env=new_env, shell=True)
Where cmd is something like "python file.py <parameters>"

Perhaps this somehow disrupts clearml operation in the sub processes?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ArrogantBlackbird16
				
					0
					 × 1

Write your answer

2K Views

15 Answers

4 years ago

2 years ago