
Reputation
Badges 1
62 × Eureka!Hi Martin, thank you for your reply.
Could you please show an example about image title/series?
My have names like Epoch_001_first_batch_train, Epoch_001_first_batch_val,
Epoch_001_first_batch_val_balanced,
Epoch_002_first_batch_train, and so on
I don't know for sure but this is what I understand from the code. But you need to have 100 experiment running at the same time, so unless you have access to 100 GPUs you should be fine
ReassuredTiger98 why don't you take 5 minutes time and check out source code? https://github.com/allegroai/clearml/blob/701fca9f395c05324dc6a5d8c61ba20e363190cf/clearml/backend_interface/task/log.py
this is pretty obvious, it replaces last task with new task when the buffer is full
from torch.utils.tensorboard import SummaryWriter
writer.add_figure('name',
figure=fig)
where fig is matplotlib
You mean I can do Epoch001/ and Epoch002/ to split them into groups and make 100 limit per group?
Thank you, I will try
AgitatedDove14 I think Tim wanted to know what is task_log_buffer_capacity
and what functionality it provides
I did my best in explanation.
You have buffer of tasks, for example 100. When you add task #101 the task under #1 is replaced with new and you keep now tasks from #2 to #101.
Because I have > 100 saved experiment, I don't think that anyone should bother to change it, unless you are running more than 100 experiments at the same time
here are requirements from the repository that I was able to run hydra_example.py and that I have crash with my custom train.py
Yes, I have latest 1.0.5 version now and it gives same result in UI as previous version that I used
Previously I had general tab in Hyper Parameters, but now without this line I don't have it.
1 more interesting bug. After I changed my "train.py" in according to hydra_exampl.py I started getting errors in the end of experiment
` --- Logging error ---
2021-08-17 13:33:28
ValueError: I/O operation on closed file.
2021-08-17 13:33:28
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 200, in write
self._terminal._original_write(message) # noqa
2021-08-17 13:33:28
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger...
I can only assume that task = Task.init(project_name=cfg.project.name, task_name=cfg.project.exp_name)
is broken because it has to read config, and depending on where I run it it has no access to config. I will investigate this with my co-worker and let you know if we find solution.
One more important thing - I have nvidia based docker running on the ubuntu server (same one that hosts clearml server) and I am afraid that initiating task from command line and from ClearML web UI run in ...
When you previously mention clone the Task I the UI and then run it, how do you actually run it?
Very good question, I need to understand it what happens when I press "Enqueue" In web UI and set it to default queue
`
cfg.pretty() is deprecated and will be removed in a future version.
Use OmegaConf.to_yaml(cfg)
--- Logging error ---
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/logging/init.py", line 1084, in emit
stream.write(msg + self.terminator)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 141, in stdout__patched__write_
return StdStreamPatch._stdout_proxy.write(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-p...
sys.stdout.close() we have it 🙂 forget to mention
Couple of words about our hydra config
it is located in root with train.py file. But the default config points to experiment folder with other configs and this is what I need to specify on every run
task = Task.init(project_name=cfg.project.name, task_name=cfg.project.exp_name) After discussion we have suspicion on using config before initing the task, can it cause any problems?
docker has access to all 4 GPUs with --gpus all flag and we specify in config on what cuda device(s) to run, in pytorch we can run more than 2 gpus
Hi AgitatedDove14 !
Thanks for your answers. Now I have a follow up. I was able to successfully run the experiment, copy it in UI and enqueue to default queue and see it complete.
No, even new started experiment is still creating images with 172.,
` cat ~/clearml.conf
ClearML SDK configuration file
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server:
web_server:
files_server: `
If it is the best practice to have 1 more docker with ClearML client - will be happy to set it up, but I see no particular benefit of spliting it out from nvidia docker that runs experiments
AgitatedDove14 orchestration module - what is this and where can I read more about it?
and experiments now stuck in "Running" mode even when the train loop is finished
Python 3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0] :: Anaconda, Inc. on linux
clearml.version
'1.0.5'
Ubuntu 20.04.1 LTS
Ok, let me check it later today and come back with the results of the example app
We have physical server in server farm that we configure with 4 GPUs, so we run all on this hardware without cloud rent
So now I did run with the example and I see hydra tab. Is the the expermient arg that I used to run it?python hydra_example.py experiment=gm_fl_dcl
Martin, thank you very much for your time and dedication, I really appreciate it
You can see the white-gray mesh on background that shows the end of the image. It is cropped in the middle of labels