
Reputation
Badges 1
58 × Eureka!Thank you! I think that is all I need to do
Yea, from all the YouTube videos it is just there with no mention of how to get it. But I don't have it
I didn't do a very scientific comparison but the # of API calls did decrease substantially by turning off auto_connect_streams
It is probably about 100k API calls per day with 1 experiment running where before it was maybe 300k API calls per day. Still seems like a lot when I only run 20-30 epochs in a day
It looks like it creates a task_repository folder in the virtual environment folder. There is a way to specify your virtual environment folder but I haven't found anyway to specify the git directory
Is there someway to kill all connections of a machine to the ClearML server this does seem to be related to restarting a task / running a new task quickly after a task fails or is aborted
I just created a new virtual environment and the problem persists. There are only two dependencies clearml and tensorflow. @<1523701070390366208:profile|CostlyOstrich36> what logs are you referring to?
Yes I see it in the terminal on the machine
Hi we are currently having the issue. There is nothing in the console regarding ClearML besides
ClearML Task: created new task id=0174d5b9d7164f47bd10484fd268e3ff
======> WARNING! Git diff too large to store (3611kb), skipping uncommitted changes <======
ClearML results page:
The console logs continue to come in put no scalers or debug images show up.
Yes tensorboard. It is still logging the tensorboard scalers and images. It just doesn't log the console output
I am using 1.15.0. Yes I can try with auto_connect_streams set to True I believe I will still have the issue
Not sure if this is helpful but this is what I get when I cntrl-c out of the hung script
^C^CException ignored in atexit callback: <bound method Reporter._handle_program_exit of <clearml.backend_interface.metrics.reporter.Reporter object at 0x70fd8b7ff1c0>>
Event reporting sub-process lost, switching to thread based reporting
Traceback (most recent call last):
File "/home/richard/.virtualenvs/temp_clearml/lib/python3.10/site-packages/clearml/backend_interface/metrics/reporter.py", lin...
They are tensorboard images that are autmomagically being logged to debug samples
I have file_history_size: 1000
I still get images for following epochs. But sometimes it seems like in the UI it limits the view to 32 images.
It is still getting stuck. I think the issue might have something to do with the iterations versus epochs. I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
How do you get answers to these types of questions? As far as I can tell model registries is broken, and there is no support through the actual application
STATUS MESSAGE: N/A
STATUS REASON: Signal None
I am on 1.16.2
task = Task.init(project_name=model_config['ClearML']['project_name'],
task_name=model_config['ClearML']['task_name'],
continue_last_task=False,
auto_connect_streams=True)
Yea I am fine not having the console logging. My issues is the scalers and debug images occasionally don't record to ClearML
It seems similar to this None is it possible saving too many model weights causes metric logging thread to die?
Then we also connect two dictionaries for configs
task.connect(model_config)
task.connect(DataAugConfig)
There is clearly some connection to the ClearML server as it remains "running" the entire training session but there are no metrics or debug samples. And I see nothing in the logs to indicate there is an issue
When the script is hung at the end the experiment says failed in ClearML
The console logging still works. Aborting the task was in the log but did not work and the process continued until I killed it.
Yes it is logging to the console. The script does hang whenever it completes all the epochs when it is having the issue.
I just used CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL
can that be put int he clearml.conf? I didn't see a reference to it in the documentaiton
Correct, so I get something like this
ClearML Task: created new task id=6ec57dcb007545aebc4ec51eb5b34c67
======> WARNING! Git diff too large to store (2536kb), skipping uncommitted changes <======
ClearML results page:
but that is all
It is not always reproducible it seems like something that we do not understand happens then the machine consistently has this issue. We believe it has something to do with stopping and starting experiments
The same training works sometimes. But I'm not sure how to troubleshoot when it stops logging the metrics
Another thing I notice is that aborting the experiment does not work when this is happening. It just continues to run