Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
Yes, super thanks AgitatedDove14 !
SuccessfulKoala55 , This is not the exact corresponding request (I refreshed the tab since then), but the request is an events.get_task_logs
, with the following content:
I am using clearml_agent v1.0.0 and clearml 0.17.5 btw
So previous_task
actually ignored the output_uri
nvm, bug might be from my side. I will open an issue if I find any easy reproducible example
Well actually I do see many errors like that in the browser console:
Done! Also I tried to use git cache ( https://git-scm.com/docs/git-credential-cache ) as a workaround (hoping that the first time it clones the experiment repo, it caches the creds for the next times, but I then get a different error: fatal: unable to find a suitable socket path; use --socket
)
btw I monkey patched ignite’s function global_step_from_engine
to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(…, global_step_transform=patched_global_step_from_engine(engine))
. It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:
if not isinstance(logger, ClearMLLogger):
...
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration
function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it
Thats how I would do it, maybe guys from allegro-ai can come up with a better approach 👍
Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
I think clearml-agent tries to execute /usr/bon/python3.6 to start the task, instead of using the python used to start clearml-agent
More context:
trains, trains-agent and trains-server all 0.16 Session.api_version -> 2.9
(both when executed in trains-agent and in local script)
And since I ran the task locally with python3.9, it used that version in the docker container
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog mem...
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
Thanks for your inputs, I will try that! For completion, here is how I retrieve the parameters:
` from trains import Task
task = Task.init("test", "test")
parent_task = Task.get_task(task.parent)
task.get_logger().report_text(task.get_parameters())
artifact_name = task.get_parameter("General/artifact_name")
artifact = parent_task.artifacts[artifact_name].get() `
Looking at the source code, it seems like I should do:data_processing_task._artifact_manager.flush()
to make sure to have the latest version of artifacts in the task, right?
AppetizingMouse58 Yes and yes
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
this is the last line, same a before
Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? It’s blocking me atm
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `