Hi ArrogantBlackbird16 ,
How do you generate and run your tasks? Do you use the same flow as in the https://clear.ml/docs/latest/docs/fundamentals/agents_and_queues#agent-and-queue-workflow ? Some other automation?
ArrogantBlackbird16 when you say spawn, what exactly do you mean? Also, are you using a locally-hosted server?
Hi TimelyPenguin76 and SuccessfulKoala55 ,
My tasks are created by first creating many sub-processes, and then in each sub-process: initializing a task, connecting the task to some parameters, cloning the task, enqueueing the cloned task, then killing the sub-process. When I do this with just a single sub-process, everything seems to work fine. When there are many sub-processes, I get the error message ocassionally.
Yes, I use a locally hosted server (SAIPS team).
TimelyPenguin76 SuccessfulKoala55
Do you have any idea what may cause this?
Is it possible that different tasks created together somehow have the same identifier?
Or am I missing something obvious?
To create each subprocess, I use the following:
import subprocess from copy import copy new_env = copy(os.environ) new_env.pop('TRAINS_PROC_MASTER_ID', None) new_env.pop('TRAINS_TASK_ID', None) new_env.pop('CLEARML_PROC_MASTER_ID', None) new_env.pop('CLEARML_TASK_ID', None) subprocess.Popen(cmd, env=new_env, shell=True)
Where cmd is something like "python file.py <parameters>"
Perhaps this somehow disrupts clearml operation in the sub processes?
ArrogantBlackbird16 the file.py
is the file contains the Task.init
call?
not sure I’m getting the flow, if you just want to create a template task in the system, clone and enqueue it, you can use task.execute_remotely(queue_name="my_queue", clone=True)
,can this solve the issue?
TimelyPenguin76 Thanks for the reply.
I believe the way I start tasks is completely independent to this problem. Assuming my approach is in principle legitimate, it does not explain why I get the following error message. Note that the error only happens when I start multiple tasks. What is the cause of this error?clearml_agent: ERROR: Instance with the same WORKER_ID [algo-lambda:gpu0] is already running
maybe I missed something here, each process also create an agent to run the task with?
I believe there is a single agent, single queue, for all tasks.
ArrogantBlackbird16 can you send a toy example so I can reproduce it my side?
Hi TimelyPenguin76 ,
Making such a toy example will take a lot of effort.
For now I intend to debug it or circumvent the error with various tricks.
If it is possible to explain the cause of the error message above, or some details regarding it, I would very much appreciate it.
Thanks for your help and quick replies.