Reputation
Badges 1
52 × Eureka!Is this just the console output while training?
Correct, so I get something like this
ClearML Task: created new task id=6ec57dcb007545aebc4ec51eb5b34c67
======> WARNING! Git diff too large to store (2536kb), skipping uncommitted changes <======
ClearML results page:
but that is all
I am on 1.16.2
task = Task.init(project_name=model_config['ClearML']['project_name'],
task_name=model_config['ClearML']['task_name'],
continue_last_task=False,
auto_connect_streams=True)
Yea, from all the YouTube videos it is just there with no mention of how to get it. But I don't have it
The console logging still works. Aborting the task was in the log but did not work and the process continued until I killed it.
I found that setting store_uncommitted_code_diff: false
instead of true seems to fix the issue
Another thing I notice is that aborting the experiment does not work when this is happening. It just continues to run
Will do! It probably won't be until next week. I don't plan on stopping this run to try it but will definitely follow up with my results.
Yea I think if we self-hosted I wouldn't have noticed it at all
Thanks! It looks like I can set
auto_connect_streams = False
in the task init at least to try.
We are using Keras so it is logging progress bars by default, which I think we could turn off. I just wouldn't expect logging text to require so many api calls. Especially since they charge by API calls I assumed it would be better managed.
Then we also connect two dictionaries for configs
task.connect(model_config)
task.connect(DataAugConfig)
So I was able to repeat the same behavior on a machine running this example None
by adding the following callback
class TensorBoardImage(TensorBoard):
@staticmethod
def make_image(tensor):
from PIL import Image
import io
tensor = np.stack((tensor, tensor, tensor), axis=2)
height, width, channels = tensor.shape
image = Image.from...
I am using 1.15.0. Yes I can try with auto_connect_streams set to True I believe I will still have the issue
I just created a new virtual environment and the problem persists. There are only two dependencies clearml and tensorflow. @<1523701070390366208:profile|CostlyOstrich36> what logs are you referring to?
It is still getting stuck. I think the issue might have something to do with the iterations versus epochs. I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
Yes it is logging to the console. The script does hang whenever it completes all the epochs when it is having the issue.
Yes tensorboard. It is still logging the tensorboard scalers and images. It just doesn't log the console output
I am still having this issue. An update is that the "abort" does not work. Even though the state is correctly tracked in ClearML when I try to abort the experiment through the UI it says it does it but the experiment remains running on the computer.
So I am only seeing values for the first epoch. It seems like it does not track all of them so maybe something is happening when it tries to log scalars.
I have seen it only log iterations but setting task.set_initial_iteration(0)
seemed to fix that so it now seems to be logging the correct epoch
Tensorboard is correct and works. I have never seen an issue in the tensorboard logs
Okay I will do another run to capture the console output. We currently set auto_connect_streams to False to reduce the number of API calls. So there isn't really anything in the ClearML task page console section
The machine currently having the issue is on tensorboard==2.16.2
Not sure if this is helpful but this is what I get when I cntrl-c out of the hung script
^C^CException ignored in atexit callback: <bound method Reporter._handle_program_exit of <clearml.backend_interface.metrics.reporter.Reporter object at 0x70fd8b7ff1c0>>
Event reporting sub-process lost, switching to thread based reporting
Traceback (most recent call last):
File "/home/richard/.virtualenvs/temp_clearml/lib/python3.10/site-packages/clearml/backend_interface/metrics/reporter.py", lin...
It looks like it creates a task_repository folder in the virtual environment folder. There is a way to specify your virtual environment folder but I haven't found anyway to specify the git directory
sometimes I get no scalars, but the console logging always seems to be working
I'm not sure how to even troubleshoot this.
When I try to abort an experiment. I get this in the log
clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
but it does not stop anything it just continues to run
Not sure why that is related to saving images
Yes it shows on the UI and has the first epoch for some of the metrics but that's it. It has run like 50 epochs, it says it is still running but there are no updates to the scalars or debug samples
Is there someway to kill all connections of a machine to the ClearML server this does seem to be related to restarting a task / running a new task quickly after a task fails or is aborted
There is clearly some connection to the ClearML server as it remains "running" the entire training session but there are no metrics or debug samples. And I see nothing in the logs to indicate there is an issue