Reputation
Badges 1
58 × Eureka!Another thing I notice is that aborting the experiment does not work when this is happening. It just continues to run
I'm not sure if it still reports logs. But it will continue running on the machine
Then we also connect two dictionaries for configs
task.connect(model_config)
task.connect(DataAugConfig)
Yes tensorboard. It is still logging the tensorboard scalers and images. It just doesn't log the console output
Is this just the console output while training?
How do you get answers to these types of questions? As far as I can tell model registries is broken, and there is no support through the actual application
I guess I don't understand I am referring to the clearml configuration file on the agent. The only way I have gotten it to consistently work is to just install the environment before hand and set that environment variable. Otherwise it seems clearml is not correctly saving the environment to be able to reproduce it. In my case the issues is installing tensorflow instead of tensorflow[and-cuda] which is what was installed
When the script is hung at the end the experiment says failed in ClearML
Running clearml_example.py in None reproduces the issue
Thank you! I think that is all I need to do
I am using 1.15.0. Yes I can try with auto_connect_streams set to True I believe I will still have the issue
Not sure why that is related to saving images
I found that setting store_uncommitted_code_diff: false instead of true seems to fix the issue
I am still having this issue. An update is that the "abort" does not work. Even though the state is correctly tracked in ClearML when I try to abort the experiment through the UI it says it does it but the experiment remains running on the computer.
Yes I see it in the terminal on the machine
It was working for me. Anyway I modified the callback. Attached is the script that has the issue for me whenever I add random_image_logger to the callbacks It only logs some of the scalars for 1 epoch. It then is stuck and never recovers. When I remove random_image_logger the scalars are correctly logged. Again this only on 1 computer, other computers we have logging work perfectly fine
This was on the same machine I am having issues with it logs scalars correctly using the example code, but when I add in that callback which just logs a random image to tensorboard I don't get any scalars logged
There is clearly some connection to the ClearML server as it remains "running" the entire training session but there are no metrics or debug samples. And I see nothing in the logs to indicate there is an issue
We are running the same code on multiple machines and it just randomly happens. Currently we are having the issue on 1 out of 4
I will try with clearml==1.16.3rc2 and see if it still has the issue
Yea I am fine not having the console logging. My issues is the scalers and debug images occasionally don't record to ClearML
I'll update my clearml version. Unfortunately I do not have a small code snippet and it is not always repeatable. Is there some additional logging that can be turned on?
Is there someway to kill all connections of a machine to the ClearML server this does seem to be related to restarting a task / running a new task quickly after a task fails or is aborted
Do the metrics not get added from the training? I did not add any metadata data but I assumed you would be able to select metrics from the training that generated the model
Yea, from all the YouTube videos it is just there with no mention of how to get it. But I don't have it
I have file_history_size: 1000 I still get images for following epochs. But sometimes it seems like in the UI it limits the view to 32 images.
STATUS MESSAGE: N/A
STATUS REASON: Signal None