task = Task.init(project_name=cfg.project.name, task_name=cfg.project.exp_name) After discussion we have suspicion on using config before initing the task, can it cause any problems?
Martin, thank you very much for your time and dedication, I really appreciate it
Nevertheless, when I try to run my training code, that differs very little from the example, I can't copy and run it from UI and I even don't see hyper parameters in experiment results
` import os
import hydra
from hydra import utils
from utils.class_utils import instantiate
from omegaconf import DictConfig, OmegaConf
from clearml import Task
@hydra.main(config_path="conf", config_name="default")
def app(cfg):
run(cfg)
def run(cfg):
task = Task.init(project_name=cfg.project.name, task_name=cfg.project.exp_name)
logger = task.get_logger()
logger.report_text("You can view your full hydra configuration under Configuration tab in the UI")
print(OmegaConf.to_yaml(cfg))
print('+'*200)
# some other hydra.utils.instantiate code
trainer.train()
if name == "main":
app() `
docker has access to all 4 GPUs with --gpus all flag and we specify in config on what cuda device(s) to run, in pytorch we can run more than 2 gpus
Python 3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0] :: Anaconda, Inc. on linux
clearml.version
'1.0.5'
Ubuntu 20.04.1 LTS
If it is the best practice to have 1 more docker with ClearML client - will be happy to set it up, but I see no particular benefit of spliting it out from nvidia docker that runs experiments
We have physical server in server farm that we configure with 4 GPUs, so we run all on this hardware without cloud rent
Martin, thank you very much for your time and dedication, I really appreciate it
My pleasure ๐
Yes, I have latest 1.0.5 version now and it gives same result in UI as previous version that I used
Hmm are you saying the auto hydra connection doesn't work ? is it the folder structure ?
When is the Task.init is called ?
See example here:
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py
As long as you import clearml on the main script, it should work. Regarding the Nvidia container, it should not interfere with any running processes, the only issue is memory limit. BTW any reason not to spin an agent on a dedicated machine? What is the gpu used for in the ckearml server machine?
Couple of words about our hydra config
it is located in root with train.py file. But the default config points to experiment folder with other configs and this is what I need to specify on every run
MortifiedDove27 did you update to the latest cleaml python package ?
Hmm are you running the clearml-agent on this machine? (This is the orchestration module, it will spin the Tasks and the dockers on the gpus)
Thanks MortifiedDove27 ! Let me see if I can reproduce it, if I understand the difference, it's the Task.init in a nested function, is that it?
BTW what's the hydra version? Python, and OS?
1 more interesting bug. After I changed my "train.py" in according to hydra_exampl.py I started getting errors in the end of experiment--- Logging error --- 2021-08-17 13:33:28 ValueError: I/O operation on closed file. 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 200, in write self._terminal._original_write(message) # noqa 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 141, in _stdout__patched__write__ return StdStreamPatch._stdout_proxy.write(*args, **kwargs) 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/logging/__init__.py", line 1084, in emit stream.write(msg + self.terminator) 2021-08-17 13:33:28 Traceback (most recent call last): 2021-08-17 13:33:28 Message: 'Waiting to finish uploads' Arguments: () 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 3005, in __shutdown self.log.info('Waiting to finish uploads') 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 2915, in _at_exit self.__shutdown() 2021-08-17 13:33:28 Call stack:
`
cfg.pretty() is deprecated and will be removed in a future version.
Use OmegaConf.to_yaml(cfg)
--- Logging error ---
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/logging/init.py", line 1084, in emit
stream.write(msg + self.terminator)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 141, in stdout__patched__write_
return StdStreamPatch._stdout_proxy.write(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 200, in write
self._terminal._original_write(message) # noqa
ValueError: I/O operation on closed file.
Call stack:
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 2915, in _at_exit
self.__shutdown()
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 3005, in __shutdown
self.log.info('Waiting to finish uploads')
Message: 'Waiting to finish uploads'
Arguments: () `
orchestration module
When you previously mention clone the Task I the UI and then run it, how do you actually run it?
regarding the exception stack
It's pointing to a stdout that was closed?! How could that be? Any chance you can provide a toy example for us to debug?
sys.stdout.close() we have it ๐ forget to mention
, I need to understand it what happens when I press "Enqueue" In web UI and set it to default queue
The Task ID is pushed into the execution queue (from the UI / backend that is it), Then you have clearml-agent
running on Your machine, the agent listens on queue/s and pulls jobs from queue.
It will pull the Task ID from the queue, setup the environment according to the Task (i.e. either inside a docker container or in a new virtual-env), clone the code/apply uncommitted changes install the python packages etc. then it will spin the code which will use the configuration in the UI (instead of logging into the UI, when executed manually)
Make sense ?
So now I did run with the example and I see hydra tab. Is the the expermient arg that I used to run it?python hydra_example.py experiment=gm_fl_dcl
Ok, let me check it later today and come back with the results of the example app
yes, all runs on same machine on different dockers
Hi AgitatedDove14 !
Thanks for your answers. Now I have a follow up. I was able to successfully run the experiment, copy it in UI and enqueue to default queue and see it complete.
AgitatedDove14 orchestration module - what is this and where can I read more about it?
Yes, I have latest 1.0.5 version now and it gives same result in UI as previous version that I used
Previously I had general tab in Hyper Parameters, but now without this line I don't have it.
and experiments now stuck in "Running" mode even when the train loop is finished
When you previously mention clone the Task Iย the UI and then run it, how do you actually run it?
Very good question, I need to understand it what happens when I press "Enqueue" In web UI and set it to default queue
here are requirements from the repository that I was able to run hydra_example.py and that I have crash with my custom train.py