Reputation
Badges 1
979 × Eureka!I fixed, will push a fix in pytorch-ignite 🙂
AgitatedDove14 one last question: how can I enforce a specific wheel to be installed?
That said, you might have accessed the artifacts before any of them were registered
I called task.wait_for_status() to make sure the task is done
Thanks AgitatedDove14 !
Could we add this task.refresh()
on the docs? Might be helpful for other users as well 🙂 OK! Maybe there is a middle ground: For artifacts already registered, returns simply the entry and for artifacts not existing, contact server to retrieve them
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem 🙂
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()
But I still get KeyError: 'output'
... Was that normal? Will it work if I replace the second line with task.refresh
() ?
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
mmmh good point actually, I didn’t think about it
Because it lives behind a VPN and github workers don’t have access to it
I’d like to move to a setup where I don’t need these tricks
is it different from Task.set_offline(True)?
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
and just run the same code I run production
even if I move the Github workers internally where they could have access to the prod server, I am not sure I would like that, because it would pile up test data in the prod server that is not necessary
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field
so that any error that could arise from communication with the server could be tested
I want in my CI tests to reproduce a run in an agent because the env changes and some things break in agents and not locally
This is not the case, I downloaded it and I got a cuda error at runtime
I think the best case scenario would be that ClearML maintains a github action that sets up a dummy clearml-server, so that anyone can use it as a basis to run their tests, so that they just have to change to URL of the server to the local one executed in the github action and they can test seamlessly all their code, wdyt?
and I didn't have this problem before because when cu117 wheels were not available, the agent was trying to get the wheel with the closest cu version and was falling back to 1.11.0+cu115, and this one was working
Also I can simply delete the /elastic_7 folder, I don’t use it anymore (I have a remote ES cluster). In that case, I guess I would have enough space?
Hey @<1523701205467926528:profile|AgitatedDove14> , Actually I just realised that I was confused by the fact that when the task is reset, because of the sorting it disappears, making it seem like it was deleted. I think it's a UX issue: When I click on reset.
- The pop shows "Deleting 100%"
- The task disappears in the list of tasks because of the sortingThis led me to thing that there was a bug and the task was deleted
Ok AgitatedDove14 SuccessfulKoala55 I made some progress in my investigation:
I can exactly pinpoint the change that introduced the bug, it is the one changing the endpoint "events.get_task_log", min_version="2.9"
In the firefox console > Network, I can edit an events.get_task_log
and change the URL from …/api/v2.9/events.get_task_log
to …/api/v2.8/events.get_task_log
(to use the endpoint "events.get_task_log", min_version="1.7"
) and then all the logs are ...
ok, so even if that guy is attached, it doesn’t report the scalars
ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another 🙂
I also did run sudo apt install nvidia-cuda-toolkit