Hi TimelyPenguin76 and SuccessfulKoala55 ,
My tasks are created by first creating many sub-processes, and then in each sub-process: initializing a task, connecting the task to some parameters, cloning the task, enqueueing the cloned task, then killing the sub-process. When I do this with just a single sub-process, everything seems to work fine. When there are many sub-processes, I get the error message ocassionally.
Yes, I use a locally hosted server (SAIPS team).
ArrogantBlackbird16 can you send a toy example so I can reproduce it my side?
Thanks for your help and quick replies.
To create each subprocess, I use the following:
import subprocess from copy import copy new_env = copy(os.environ) new_env.pop('TRAINS_PROC_MASTER_ID', None) new_env.pop('TRAINS_TASK_ID', None) new_env.pop('CLEARML_PROC_MASTER_ID', None) new_env.pop('CLEARML_TASK_ID', None) subprocess.Popen(cmd, env=new_env, shell=True)
Where cmd is something like "python file.py <parameters>"
Perhaps this somehow disrupts clearml operation in the sub processes?
maybe I missed something here, each process also create an agent to run the task with?
Hi ArrogantBlackbird16 ,
How do you generate and run your tasks? Do you use the same flow as in the https://clear.ml/docs/latest/docs/fundamentals/agents_and_queues#agent-and-queue-workflow ? Some other automation?
Hi TimelyPenguin76 ,
Making such a toy example will take a lot of effort.
For now I intend to debug it or circumvent the error with various tricks.
If it is possible to explain the cause of the error message above, or some details regarding it, I would very much appreciate it.
ArrogantBlackbird16 when you say spawn, what exactly do you mean? Also, are you using a locally-hosted server?
TimelyPenguin76 Thanks for the reply.
I believe the way I start tasks is completely independent to this problem. Assuming my approach is in principle legitimate, it does not explain why I get the following error message. Note that the error only happens when I start multiple tasks. What is the cause of this error?clearml_agent: ERROR: Instance with the same WORKER_ID [algo-lambda:gpu0] is already running
TimelyPenguin76 SuccessfulKoala55
Do you have any idea what may cause this?
Is it possible that different tasks created together somehow have the same identifier?
Or am I missing something obvious?
I believe there is a single agent, single queue, for all tasks.
ArrogantBlackbird16 the file.py
is the file contains the Task.init
call?
not sure I’m getting the flow, if you just want to create a template task in the system, clone and enqueue it, you can use task.execute_remotely(queue_name="my_queue", clone=True)
,can this solve the issue?