Reputation
Badges 1
13 × Eureka!TimelyPenguin76 SuccessfulKoala55
Do you have any idea what may cause this?
Is it possible that different tasks created together somehow have the same identifier?
Or am I missing something obvious?
I believe there is a single agent, single queue, for all tasks.
Hi TimelyPenguin76 ,
Making such a toy example will take a lot of effort.
For now I intend to debug it or circumvent the error with various tricks.
If it is possible to explain the cause of the error message above, or some details regarding it, I would very much appreciate it.
Hi AgitatedDove14 ,
Continuing from the previous question: Is it possible to detect remote Task execution before the remote Task.init(...) function call?
For example, when I run this:
` print("Doing some computations that MUST be local") # I want to prevent this from running remotely
task = Task.init("OMD", task_name="bla")
task.set_base_docker("/home/rdekel/anaconda3/envs/P1")
cloned_task = Task.clone(source_task=task, name="Clone")
Task.enqueue(cloned_task.id, queue_name="ron_lambda_cp...
To create each subprocess, I use the following:
import subprocess from copy import copy new_env = copy(os.environ) new_env.pop('TRAINS_PROC_MASTER_ID', None) new_env.pop('TRAINS_TASK_ID', None) new_env.pop('CLEARML_PROC_MASTER_ID', None) new_env.pop('CLEARML_TASK_ID', None) subprocess.Popen(cmd, env=new_env, shell=True)
Where cmd is something like "python file.py <parameters>"
Perhaps this somehow disrupts clearml operation in the sub processes?
Thanks for your help and quick replies.
TimelyPenguin76 Thanks for the reply.
I believe the way I start tasks is completely independent to this problem. Assuming my approach is in principle legitimate, it does not explain why I get the following error message. Note that the error only happens when I start multiple tasks. What is the cause of this error?clearml_agent: ERROR: Instance with the same WORKER_ID [algo-lambda:gpu0] is already running
Hi TimelyPenguin76 and SuccessfulKoala55 ,
My tasks are created by first creating many sub-processes, and then in each sub-process: initializing a task, connecting the task to some parameters, cloning the task, enqueueing the cloned task, then killing the sub-process. When I do this with just a single sub-process, everything seems to work fine. When there are many sub-processes, I get the error message ocassionally.
Yes, I use a locally hosted server (SAIPS team).
SuccessfulKoala55 re-attached the log.