Reputation
Badges 1
62 × Eureka!docker has access to all 4 GPUs with --gpus all flag and we specify in config on what cuda device(s) to run, in pytorch we can run more than 2 gpus
Ok, let me check it later today and come back with the results of the example app
Hi AgitatedDove14 !
Thanks for your answers. Now I have a follow up. I was able to successfully run the experiment, copy it in UI and enqueue to default queue and see it complete.
So now I did run with the example and I see hydra tab. Is the the expermient arg that I used to run it?python hydra_example.py experiment=gm_fl_dcl
here are requirements from the repository that I was able to run hydra_example.py and that I have crash with my custom train.py
and experiments now stuck in "Running" mode even when the train loop is finished
AgitatedDove14 orchestration module - what is this and where can I read more about it?
task = Task.init(project_name=cfg.project.name, task_name=cfg.project.exp_name) After discussion we have suspicion on using config before initing the task, can it cause any problems?
sys.stdout.close() we have it 🙂 forget to mention
If it is the best practice to have 1 more docker with ClearML client - will be happy to set it up, but I see no particular benefit of spliting it out from nvidia docker that runs experiments
We have physical server in server farm that we configure with 4 GPUs, so we run all on this hardware without cloud rent
Couple of words about our hydra config
it is located in root with train.py file. But the default config points to experiment folder with other configs and this is what I need to specify on every run
Previously I had general tab in Hyper Parameters, but now without this line I don't have it.
Yes, I have latest 1.0.5 version now and it gives same result in UI as previous version that I used
Martin, thank you very much for your time and dedication, I really appreciate it
You can see the white-gray mesh on background that shows the end of the image. It is cropped in the middle of labels
I save it to PC and it is not only UI issue. My guess is that it plt.fig is cropped or by SummaryWriter or by trains. Help me debug where is the problem, I will meanwhile try to see what this SummaryWriter does to plt plots
Hi, I solved this cut out of labels withfig.tight_layout() return fig
AgitatedDove14 I think Tim wanted to know what is task_log_buffer_capacity
and what functionality it provides
1 more interesting bug. After I changed my "train.py" in according to hydra_exampl.py I started getting errors in the end of experiment
` --- Logging error ---
2021-08-17 13:33:28
ValueError: I/O operation on closed file.
2021-08-17 13:33:28
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 200, in write
self._terminal._original_write(message) # noqa
2021-08-17 13:33:28
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger...
yes, all runs on same machine on different dockers
Hi David, where can I get these logs?
You mean I can do Epoch001/ and Epoch002/ to split them into groups and make 100 limit per group?
Thank you, I will try
TimelyPenguin76 Thank you for posting this. I just realized that I changed wrong config. I changed the one on server, but I needed to change the one inside the docker container. Now all works. Thanks for help!
Thank you. I've changed clearml.conf, but url are remain with old ip. Do I need to restart ClearML or run any command to apply config changes?