Ok AgitatedDove14 SuccessfulKoala55 I made some progress in my investigation:
I can exactly pinpoint the change that introduced the bug, it is the one changing the endpoint "events.get_task_log", min_version="2.9"
In the firefox console > Network, I can edit an events.get_task_log
and change the URL from β¦/api/v2.9/events.get_task_log
to β¦/api/v2.8/events.get_task_log
(to use the endpoint "events.get_task_log", min_version="1.7"
) and then all the logs are ...
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
I made sure before deleting the old index that the number of docs matched
Should I try to disable dynamic mapping before doing the reindex operation?
my agents are all .16 and I install trains 0.16rc2 in each Task being executed by the agent
how would it interact with the clearml-server api service? would it be completely transparent?
That said, you might have accessed the artifacts before any of them were registered
I called task.wait_for_status() to make sure the task is done
Yes, but I am not certain how: I just deleted the /data folder and restarted the server
So I created a symlink in /opt/train/data -> /data
it worked for the other folder, so I assume yes --> I archived the /opt/trains/data/mongo, sent the archive via scp, unarchived, updated the rights and now it works
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit
you mean to run it on the CI machine ?
yes
That should not happen, no? Maybe there is a bug that needs fixing on clearml-agent ?
It just to test that the logic being executed in if not Task.running_locally()
is correct
AppetizingMouse58 btw I had to delete the old logs index before creating the alias, otherwise ES wonβt let me create an alias with the same name as an existing index
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
Also what is the benefit of having by default index.number_of_shards = 1
for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
I am using 0.17.5, it could be either a bug on ignite or indeed a delay on the send. I will try to build a simple reproducible example to understand to cause
Alright, thanks for the answer! Seems legit then π
Indeed, I actually had the old configuration that was not JSON - I converted to json, now works π
So it seems like it doesn't copy /root/clearml.conf and it doesn't pass the environment variables (CLEARML_API_HOST, CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY)
Ok, now I would like to copy from one machine to another via scp, so I copied the whole /opt/trains/data folder, but I got the following errors: