Reputation
Badges 1
58 × Eureka!It was working for me. Anyway I modified the callback. Attached is the script that has the issue for me whenever I add random_image_logger to the callbacks It only logs some of the scalars for 1 epoch. It then is stuck and never recovers. When I remove random_image_logger the scalars are correctly logged. Again this only on 1 computer, other computers we have logging work perfectly fine
It looks like it creates a task_repository folder in the virtual environment folder. There is a way to specify your virtual environment folder but I haven't found anyway to specify the git directory
The machine currently having the issue is on tensorboard==2.16.2
It's possible, is there a way to just slow down or turn off the log streaming to see how it affects the API calls?
I didn't do a very scientific comparison but the # of API calls did decrease substantially by turning off auto_connect_streams It is probably about 100k API calls per day with 1 experiment running where before it was maybe 300k API calls per day. Still seems like a lot when I only run 20-30 epochs in a day
We are running the same code on multiple machines and it just randomly happens. Currently we are having the issue on 1 out of 4
There is clearly some connection to the ClearML server as it remains "running" the entire training session but there are no metrics or debug samples. And I see nothing in the logs to indicate there is an issue
Thank you! I think that is all I need to do
Yea I am fine not having the console logging. My issues is the scalers and debug images occasionally don't record to ClearML
I found that setting store_uncommitted_code_diff: false instead of true seems to fix the issue
I will try with clearml==1.16.3rc2 and see if it still has the issue
The same training works sometimes. But I'm not sure how to troubleshoot when it stops logging the metrics
So I am only seeing values for the first epoch. It seems like it does not track all of them so maybe something is happening when it tries to log scalars.
I have seen it only log iterations but setting task.set_initial_iteration(0) seemed to fix that so it now seems to be logging the correct epoch
Tensorboard is correct and works. I have never seen an issue in the tensorboard logs
Not sure why that is related to saving images
No it completes and exists the script
I'm not sure how to even troubleshoot this.
When I try to abort an experiment. I get this in the log
clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
but it does not stop anything it just continues to run
It seems similar to this None is it possible saving too many model weights causes metric logging thread to die?
Yes I see it in the terminal on the machine
I have file_history_size: 1000 I still get images for following epochs. But sometimes it seems like in the UI it limits the view to 32 images.
This was on the same machine I am having issues with it logs scalars correctly using the example code, but when I add in that callback which just logs a random image to tensorboard I don't get any scalars logged
I am still having this issue. An update is that the "abort" does not work. Even though the state is correctly tracked in ClearML when I try to abort the experiment through the UI it says it does it but the experiment remains running on the computer.
Will do! It probably won't be until next week. I don't plan on stopping this run to try it but will definitely follow up with my results.
Yea I think if we self-hosted I wouldn't have noticed it at all
Thanks! It looks like I can set
auto_connect_streams = False
in the task init at least to try.
We are using Keras so it is logging progress bars by default, which I think we could turn off. I just wouldn't expect logging text to require so many api calls. Especially since they charge by API calls I assumed it would be better managed.
They are tensorboard images that are autmomagically being logged to debug samples
Yes it is logging to the console. The script does hang whenever it completes all the epochs when it is having the issue.
I just used CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL can that be put int he clearml.conf? I didn't see a reference to it in the documentaiton
I am using 1.15.0. Yes I can try with auto_connect_streams set to True I believe I will still have the issue