Hi, I Am Trying To Run Experiment From Clearml Web Ui. I Did Experiment Copy, Enqueue, But In The Execution Log I See That It Runs Command

Answered

Hi, I am trying to run experiment from ClearML web ui. I did experiment copy, enqueue, but in the execution log I see that it runs command
[.]$ /home/exx/.clearml/venvs-builds/3.8/bin/python -u train.py but I need to add experiment=my_config after train.py. Is there any way to do it from UI?
Thank you

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Votes Newest

Answers 31

Thanks MortifiedDove27 ! Let me see if I can reproduce it, if I understand the difference, it's the Task.init in a nested function, is that it?
BTW what's the hydra version? Python, and OS?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

here are requirements from the repository that I was able to run hydra_example.py and that I have crash with my custom train.py

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Previously I had general tab in Hyper Parameters, but now without this line I don't have it.

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

1 more interesting bug. After I changed my "train.py" in according to hydra_exampl.py I started getting errors in the end of experiment
--- Logging error --- 2021-08-17 13:33:28 ValueError: I/O operation on closed file. 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 200, in write self._terminal._original_write(message) # noqa 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 141, in _stdout__patched__write__ return StdStreamPatch._stdout_proxy.write(*args, **kwargs) 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/logging/__init__.py", line 1084, in emit stream.write(msg + self.terminator) 2021-08-17 13:33:28 Traceback (most recent call last): 2021-08-17 13:33:28 Message: 'Waiting to finish uploads' Arguments: () 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 3005, in __shutdown self.log.info('Waiting to finish uploads') 2021-08-17 13:33:28 File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 2915, in _at_exit self.__shutdown() 2021-08-17 13:33:28 Call stack:

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Hmm are you running the clearml-agent on this machine? (This is the orchestration module, it will spin the Tasks and the dockers on the gpus)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I can only assume that task = Task.init(project_name=cfg.project.name, task_name=cfg.project.exp_name) is broken because it has to read config, and depending on where I run it it has no access to config. I will investigate this with my co-worker and let you know if we find solution.

One more important thing - I have nvidia based docker running on the ubuntu server (same one that hosts clearml server) and I am afraid that initiating task from command line and from ClearML web UI run in different environments and this causes issues, but I don't know how to check the differences

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

When you previously mention clone the Task I the UI and then run it, how do you actually run it?

Very good question, I need to understand it what happens when I press "Enqueue" In web UI and set it to default queue

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

and experiments now stuck in "Running" mode even when the train loop is finished

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

If it is the best practice to have 1 more docker with ClearML client - will be happy to set it up, but I see no particular benefit of spliting it out from nvidia docker that runs experiments

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

`
cfg.pretty() is deprecated and will be removed in a future version.
Use OmegaConf.to_yaml(cfg)

--- Logging error ---
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/logging/init.py", line 1084, in emit
stream.write(msg + self.terminator)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 141, in stdout__patched__write_
return StdStreamPatch._stdout_proxy.write(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/logger.py", line 200, in write
self._terminal._original_write(message) # noqa
ValueError: I/O operation on closed file.
Call stack:
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 2915, in _at_exit
self.__shutdown()
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 3005, in __shutdown
self.log.info('Waiting to finish uploads')
Message: 'Waiting to finish uploads'
Arguments: () `

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

MortifiedDove27 did you update to the latest cleaml python package ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

docker has access to all 4 GPUs with --gpus all flag and we specify in config on what cuda device(s) to run, in pytorch we can run more than 2 gpus

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Thanks!

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

task = Task.init(project_name=cfg.project.name, task_name=cfg.project.exp_name) After discussion we have suspicion on using config before initing the task, can it cause any problems?

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Hi AgitatedDove14 !
Thanks for your answers. Now I have a follow up. I was able to successfully run the experiment, copy it in UI and enqueue to default queue and see it complete.

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

AgitatedDove14 orchestration module - what is this and where can I read more about it?

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Martin, thank you very much for your time and dedication, I really appreciate it

My pleasure 🙂

Yes, I have latest 1.0.5 version now and it gives same result in UI as previous version that I used

Hmm are you saying the auto hydra connection doesn't work ? is it the folder structure ?
When is the Task.init is called ?
See example here:
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Python 3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0] :: Anaconda, Inc. on linux
clearml.version
'1.0.5'
Ubuntu 20.04.1 LTS

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Ok, let me check it later today and come back with the results of the example app

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

So now I did run with the example and I see hydra tab. Is the the expermient arg that I used to run it?
python hydra_example.py experiment=gm_fl_dcl

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Martin, thank you very much for your time and dedication, I really appreciate it

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

yes, all runs on same machine on different dockers

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Nevertheless, when I try to run my training code, that differs very little from the example, I can't copy and run it from UI and I even don't see hyper parameters in experiment results
` import os
import hydra
from hydra import utils
from utils.class_utils import instantiate
from omegaconf import DictConfig, OmegaConf
from clearml import Task

@hydra.main(config_path="conf", config_name="default")
def app(cfg):
run(cfg)

def run(cfg):

task = Task.init(project_name=cfg.project.name, task_name=cfg.project.exp_name)
logger = task.get_logger()
logger.report_text("You can view your full hydra configuration under Configuration tab in the UI")

print(OmegaConf.to_yaml(cfg))
print('+'*200)

# some other hydra.utils.instantiate code

trainer.train()

if name == "main":
app() `

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

sys.stdout.close() we have it 🙂 forget to mention

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Couple of words about our hydra config
it is located in root with train.py file. But the default config points to experiment folder with other configs and this is what I need to specify on every run

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

Yes, I have latest 1.0.5 version now and it gives same result in UI as previous version that I used

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

As long as you import clearml on the main script, it should work. Regarding the Nvidia container, it should not interfere with any running processes, the only issue is memory limit. BTW any reason not to spin an agent on a dedicated machine? What is the gpu used for in the ckearml server machine?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

We have physical server in server farm that we configure with 4 GPUs, so we run all on this hardware without cloud rent

  				
Posted 
	3 years ago

					More  		
  Report
		
					MortifiedDove27
				
					0
					 × 1

, I need to understand it what happens when I press "Enqueue" In web UI and set it to default queue

The Task ID is pushed into the execution queue (from the UI / backend that is it), Then you have clearml-agent running on Your machine, the agent listens on queue/s and pulls jobs from queue.
It will pull the Task ID from the queue, setup the environment according to the Task (i.e. either inside a docker container or in a new virtual-env), clone the code/apply uncommitted changes install the python packages etc. then it will spin the code which will use the configuration in the UI (instead of logging into the UI, when executed manually)
Make sense ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Show more results

Write your answer

66K Views

31 Answers

3 years ago

one year ago