
Reputation
Badges 1
981 × Eureka!Still getting the same error, it is not taken into account 🤔
Thats how I would do it, maybe guys from allegro-ai can come up with a better approach 👍
line 13 is empty 🤔
The parent task is a data_processing task, therefore I retrieve it so that I can then data_processed = parent_task.artifacts["data_processed"]
trains-agent daemon --gpus 0 --queue default & trains-agent daemon --gpus 1 --queue default &
AgitatedDove14 , my “uncommitted changes” ends with... if __name__ == "__main__": task = clearml.Task.get_task(clearml.config.get_remote_task_id()) task.connect(config) run() from clearml import Task Task.init()
that would work for pytorch and clearml yes, but what about my local package?
I opened an https://github.com/pytorch/ignite/issues/2343 in ignite’s repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init
in distributed envs
But we can easily extend, right?
I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though
Thanks AgitatedDove14 ! I created a project with a default output destination to a s3 bucket but I don't have local access to this bucket (only agents have access to it for security reasons). Because of that, I cannot create a task in this project programmatically locally because it tries to access the bucket and fails. And there is no easy way to change the default output location (not in the web UI, not in the sdk)
So the new EventsIterator
is responsible for the bug.
Is there a way for me to easily force the WebUI to always use the previous endpoint (v1.7)? I saw in the diff changes v1.1.0 > v1.2.0 that ES version was bumped to 7.16.2. I am using an external ES cluster, and its version is still 7.6.2. Can it be that the incompatibility comes from here? I’ll update the cluster to make sure it’s not the case
because at some point it introduces too much overhead I guess
on /data or /opt/clearml? these are two different disks
And after the update, the loss graph appears
you mean to run it on the CI machine ?
yes
That should not happen, no? Maybe there is a bug that needs fixing on clearml-agent ?
It just to test that the logic being executed in if not Task.running_locally()
is correct
Here are the logs of the agent :)
` (base) user@worker:~$ tail -f /tmp/.clearml_agent_daemon_outjdups8t2.txt
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
+----------------------------------+--------+-------+
| id | name | tags |
+----------------------------------+--------+-------+
| 54e4a62a402d5135612ba7b12cfe4e57 | docker | |
+----------------------------------+--------+-------+
Starting infinite tas...
I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent
AgitatedDove14 Yes exactly, I tried the fix suggested in the github issue urllib3>=1.25.4
and the ImportError disappeared 🙂
Yes, in the Task being executed in the agents, I have:from trains import Task task = Task.init(...) task.get_logger().report_text(str(task.get_parameters()))
sure, will be happy to debug that 🙂
AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration
function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it
ok, and if not the case, it will fall back to 3.8, right? Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
mmmh good point actually, I didn’t think about it
Awesome! (Broken link in migration guide, step 3: https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/ )
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
but then why do I have to do task.connect_configuration(read_yaml(conf_path))._to_dict()
?
Why not task.connect_configuration(read_yaml(conf_path))
simply?
I mean what is the benefit of returning ProxyDictPostWrite
instead of a dict?