Reputation
Badges 1
49 × Eureka!sometimes I get no scalars, but the console logging always seems to be working
Not sure why that is related to saving images
I found that setting store_uncommitted_code_diff: false
instead of true seems to fix the issue
I just created a new virtual environment and the problem persists. There are only two dependencies clearml and tensorflow. @<1523701070390366208:profile|CostlyOstrich36> what logs are you referring to?
Running clearml_example.py in None reproduces the issue
So I was able to repeat the same behavior on a machine running this example None
by adding the following callback
class TensorBoardImage(TensorBoard):
@staticmethod
def make_image(tensor):
from PIL import Image
import io
tensor = np.stack((tensor, tensor, tensor), axis=2)
height, width, channels = tensor.shape
image = Image.from...
We are running the same code on multiple machines and it just randomly happens. Currently we are having the issue on 1 out of 4
Yes it shows on the UI and has the first epoch for some of the metrics but that's it. It has run like 50 epochs, it says it is still running but there are no updates to the scalars or debug samples
I'm not sure if it still reports logs. But it will continue running on the machine
Is there someway to kill all connections of a machine to the ClearML server this does seem to be related to restarting a task / running a new task quickly after a task fails or is aborted
Is this just the console output while training?
I'm not sure how to even troubleshoot this.
Thanks! It looks like I can set
auto_connect_streams = False
in the task init at least to try.
We are using Keras so it is logging progress bars by default, which I think we could turn off. I just wouldn't expect logging text to require so many api calls. Especially since they charge by API calls I assumed it would be better managed.
It's possible, is there a way to just slow down or turn off the log streaming to see how it affects the API calls?
Yes it is logging to the console. The script does hang whenever it completes all the epochs when it is having the issue.
Will do! It probably won't be until next week. I don't plan on stopping this run to try it but will definitely follow up with my results.
Yea I think if we self-hosted I wouldn't have noticed it at all
Hi we are currently having the issue. There is nothing in the console regarding ClearML besides
ClearML Task: created new task id=0174d5b9d7164f47bd10484fd268e3ff
======> WARNING! Git diff too large to store (3611kb), skipping uncommitted changes <======
ClearML results page:
The console logs continue to come in put no scalers or debug images show up.
It seems similar to this None is it possible saving too many model weights causes metric logging thread to die?
It looks like it creates a task_repository folder in the virtual environment folder. There is a way to specify your virtual environment folder but I haven't found anyway to specify the git directory