Reputation
Badges 1
979 × Eureka!in the UI the value is correct one (not empty, a string)
AgitatedDove14 Up π I would like to know if I should wait for next release of trains or if I can already start implementing azure support
Yes I did, I found the problem: docker-compose was using trains-server 0.15 because it didn't see the new version of trains-server. Hence I had trains-server 0.15 running with ES7.
-> I deleted all the containers and it successfully pulled trains-server 0.16. Now everything is running properly π
But clearml does read from env vars as well right? Itβs not just delegating resolution to the aws cli, so it should be possible to specify the region to use for the logger, right?
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
(btw, yes I adapted to use Task.init(...output_uri=)
thanks for your help!
AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?
Iβd like to move to a setup where I donβt need these tricks
Hi SuccessfulKoala55 , not really wrong, rather I don't understand it, the docker image with the args after it
Ok yes, I get it, this info is also available at the very beginning of the logs, where the agent logs the full docker run command, this docker_cmd is a shorter version?
Yes AgitatedDove14 π
Thanks SuccessfulKoala55 ! So CLEARML_NO_DEFAULT_SERVER=1 by default, right?
Yes I agree, but I get a strange error when using dataloaders:RuntimeError: [enforce fail at context_gpu.cu:323] error == cudaSuccess. 3 vs 0. Error at: /pytorch/caffe2/core/context_gpu.cu:323: initialization error
only when I use num_workers > 0
yes but they are in plain text and I would like to avoid that
So the new EventsIterator
is responsible for the bug.
Is there a way for me to easily force the WebUI to always use the previous endpoint (v1.7)? I saw in the diff changes v1.1.0 > v1.2.0 that ES version was bumped to 7.16.2. I am using an external ES cluster, and its version is still 7.6.2. Can it be that the incompatibility comes from here? Iβll update the cluster to make sure itβs not the case
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)
Ok AgitatedDove14 SuccessfulKoala55 I made some progress in my investigation:
I can exactly pinpoint the change that introduced the bug, it is the one changing the endpoint "events.get_task_log", min_version="2.9"
In the firefox console > Network, I can edit an events.get_task_log
and change the URL from β¦/api/v2.9/events.get_task_log
to β¦/api/v2.8/events.get_task_log
(to use the endpoint "events.get_task_log", min_version="1.7"
) and then all the logs are ...
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
I made sure before deleting the old index that the number of docs matched
Should I try to disable dynamic mapping before doing the reindex operation?