Reputation
Badges 1
19 × Eureka!Oh, so the task has an internal keepalive mechanism and me calling time.sleep()
for more than 2 hours prevents it from working?
@<1576381444509405184:profile|ManiacalLizard2> , thanks, that was my initial solution, but I had some trouble with reusing the previously created task for the scheduler when the process that made the call to TaskScheduler.add_task()
was interrupted.
The "new problem" was not being able to view the console, scalars, plots, debug samples of previous experiments (probably because the reference to them was in /usr/share/elasticsearch/data
, which I changed to /var/lib/elasticsearch/data
in the new installation, in an attempt to install it without having sudo permissions).
There is no chance of corrupting other experiments or databases?
This container may have not been introduced in this version yet, I don't see one in docker ps
Do you think that updating to a new version will probably fix this?
As a temporary solution, shutting down the entire docker-compose, deleting the left over files using administrator permissions and then bringing it back up again, does this sound reasonable?
WebApp: 1.9.1-312 • Server: 1.9.1-312 • API: 2.23
The value is the same.
As I mentioned before, the server version I'm working with does not have the async_delete container. Unfortunately, due to internal considerations, the version update will not take place in the near future, so I am having the system admin delete them for me manually every once in a while.
Keeping the current version and deleting manually
@<1523701070390366208:profile|CostlyOstrich36> It works! thanks!
OK thanks. Just curious then, suppose you use the task for normal experiment tracking, you do Task.init()
in the beginning as usual and train you model and your epochs are longer then 2 hours and you only print/report stuff at epoch end, would this cause the task to abort too?
Is there a way to view the version in the web GUI?
SDK 1.14.4
Server 1.14.1-451
I meant, as a temporary solution instead of upgrading
@<1523703097560403968:profile|CumbersomeCormorant74> Hi, thanks for the suggestion, but unfortunately, it did not work.
After some experimenting it seems that the situation improves when I call task.mark_started(force=True)
before each task.upload_artifact()
instead of just once in the beginning of the script.
Seems there are two approaches, either "revive" before each upload, or somehow keep it always "Running", do you have an idea how the second approach can be achieved? (I did not call task.close()
or task.mark_*()
anywhere).