For local testing, we have added a
ScantChimpanzee51 there is already an environment variable for that, you can just set CLEARML_OFFLINE_MODE
🙂
By the way, if we don’t wrap other calls in
is_offline()
we get errors like “DateTime object is not serializable”, but that’s a secondary issue.
I think this was fixed, can you verify with the latest RC 1.7.3rc0
? If this still happens can you share the code
However, this results in the process getting interrupted, the outputs show:
Are you saying offlinemode is broken ?
It might be broken for me, as I said the program works without the offline mode but gets interrupted and shows the results from above with offline mode.
How could I reproduce this issue ?
But there might be another issue in between of course - any idea how to debug?
I think I missed this one, what exactly is the issue ?
Hi AgitatedDove14 , so it took some time but I’ve finally managed to reproduce. The issue seems to be related to writing images via Tensorboard:
` from torch.utils.tensorboard import SummaryWriter
import torch
from clearml import Task, Logger
if name == "main":
task = Task.init(project_name="ClearML-Debug", task_name="[Mac] TB Logger, offline")
tb_logger = SummaryWriter(log_dir="tb_logger/demo/")
image_tensor = torch.rand(256, 256, 3)
for iter in range(10):
tb_logger.add_image(f"images/image123/img", image_tensor, iter, dataformats="HWC")
task.flush(wait_for_uploads=True) `Again the errors show up as
2022-11-09 09:47:27,602 - clearml.metrics - WARNING - Failed uploading to /Users/manuel/.clearml/cache/offline/offline-028b2df9167049eba4bdce7c6f89f39e/data (Target path "/Users/manuel/.clearml/cache/offline/offline-028b2df9167049eba4bdce7c6f89f39e/data" does not exist
Any idea about that? I’m also happy to open an issue on GitHub with the details if you like 🙂
By the way no rush about this - we will turn off TB logging in the meantime.
Thanks for all the help!
I meant maybe me activating offline mode, somehow changes something else in the runtime and that in turn leads to the interruption. Let me try to build a minimal reproducible version 🙂
BTW, this one seems to work ....
` from time import sleep
from clearml import Task
Task.set_offline(True)
task = Task.init(project_name="debug", task_name="offline test")
print("starting")
for i in range(300):
print(f"{i}")
sleep(1)
print("done") `
Thanks ScantChimpanzee51 !
Let me see what I can find, should be easy enough to fix now 🙂
It might be broken for me, as I said the program works without the offline mode but gets interrupted and shows the results from above with offline mode. But there might be another issue in between of course - any idea how to debug?
The environment variable is good to know, I will try with that as well and report back.
Let me try to build a minimal reproducible version
Thank you!
So AgitatedDove14 if we use the CLEARML_OFFLINE_MODE
environment variable instead the program runs through again.
The only thing is that now we get errors of the form0%| | 0/18 [00:00<?, ?image/s]ClearML running in offline mode, session stored in /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486 2022-11-07 07:49:06,986 - clearml.metrics - WARNING - Failed uploading to /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486/data (Target path "/home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486/data" does not exist)
I’ve checked the path and it does exist but for the data subdirectory, i.e. /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486/
exists but in there is no data
directory. Any idea where that could come from? Could we turn off the local logging as well - in these kinds of runs we don’t need it?
By the way, if we don’t wrap other calls in is_offline()
we get errors like “DateTime object is not serializable”, but that’s a secondary issue.
Any idea where that could come from? Could we turn off the local logging as well - in these kinds of runs we don’t need it?
It is supposed to create it automatically... I tested with other examples (clearml version 1.7.3rc1) everything seems to work
What am I missing? how do we recreate the issue ? can you verify it is still not working with the latest RC?