Reputation
Badges 1
981 × Eureka!I have the same problem, but not only with subprojects, but for all the projects, I get this blank overview tab as shown in the screenshot. It only worked for one project, that I created one or two weeks ago under 0.17
Ok, I guess Iโll just delete the whole loss series. Thanks!
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
but according to the disks graphs, the OS disk is being used, but not the data disk
Hi CostlyOstrich36 , this weekend I took a look at the diffs with the previous version ( https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0# ) and I saw several changes related to the scrolling/logging:
apiserver/bll/event/ http://log_events_iterator.py apiserver/bll/event/ http://events_iterator.py apiserver/config/default/services/_mongo.conf apiserver/database/model/ http://base.py apiserver/services/ http://events.pyI suspect that one of these changes might be responsible ...
I ended up dropping omegaconf altogether
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
I also tried task.set_initial_iteration(-task.data.last_iteration) , hoping it would counteract the bug, didnโt work
Actually I think I am approaching the problem from the wrong angle
AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats ๐ ๐ Was this bug fixed with this new version?
Hi CostlyOstrich36 , one more observation: it looks like when I donโt open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I donโt see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
And I do that each time I want to create a subtask. This way I am sure to retrieve the task if it already exists
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117. It just happens that this wheel doesn't work in ec2 g5 instances suprizingly. Either I'll hardcode the correct wheel or I'll upgrade torch to 1.13.0
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
I am sorry to give infos that are not very precise, but itโs the best I can do - Is this bug happening only to me?
Ho nice, thanks for pointing this out!
I am running on bare metal, and cuda seems to be installed at /usr/lib/x86_64-linux-gnu/libcuda.so.460.39
ok, now I actually remember why I used _update_requirements instead of add_requirements: The first overwrites all the other, the later only add to the already detected packages. Since my deps are listed in the dependencies of my setup.py, I don't want clearml to list the dependencies of the current environment
Thanks TimelyPenguin76 and AgitatedDove14 ! I would like to delete artifacts/models related to the old archived experiments, but they are stored on s3. Would that be possible?
Whohoo! Thanks ๐
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didnโt change anything else