Answered

Hi Everyone. I Have An Issue With The Simple Pipeline - It Runs Two Similar Nn Training Steps (Tf2.3, Windows10, Python 3.7) With Only Difference Is A Batch Size. I'M Running First Separately Each Step To Have Them In Clearml Project Page. Then I Run Pipe

Hi everyone. I have an issue with the simple pipeline - it runs two similar nn training steps (tf2.3, windows10, python 3.7) with only difference is a batch size. I'm running first separately each step to have them in ClearML project page. Then I run pipeline controller, which makes a clone of each step and runs smoothly. If I run pipeline from command string again, it works Ok. However, if I clone and enqueue the pipeline, it starts, creates the clone of the fist step pending and then nothing happens. First step remains pending and doesn't start. Can anyone help with the issue? Here's the pipeline controller code:
` from clearml import Task
from clearml.automation.controller import PipelineController

Connecting ClearML with the current process,

from here on everything is logged automatically

task = Task.init(project_name='Tom', task_name='test pipeline',
task_type=Task.TaskTypes.controller, reuse_last_task_id=False)

pipe = PipelineController(default_execution_queue='default', add_pipeline_tags=False)
pipe.add_step(name='train_1st_nn_copy', base_task_project='Tom', base_task_name='train_1st_nn', parameter_override={'batch_size': 8})
pipe.add_step(name='train_2nd_nn_copy', parents=['train_1st_nn_copy', ],
base_task_project='Tom', base_task_name='train_2nd_nn',
parameter_override={'batch_size': 4})

Starting the pipeline (in the background)

pipe.start()

Wait until pipeline terminates

pipe.wait()

cleanup everything

pipe.stop()

print('done') `If I abort pipeline controller task, pending "train_1st_nn" task executes ok.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

Votes Newest

Answers 31

AgitatedDove14 Yes, that's what I have - for me it's weird, too.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

BattyLion34 Okay, I'll try to see if we can solve the multi-instance issue on Windows (because obviously it should be automatic)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BattyLion34

if I simply clone nntraining stage and run it in default queue - everything goes fine.

When you compare the Task you clone manually and the Task created by the pipeline , what's the difference ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ok, ran (just used point instead of comma in print statement - comment if someone reading this will run this code). Attached to this message.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

Hi BattyLion34
I might have a solution, in order to make sure the two agents are not sharing the "temp" folder:
create two copies of ~/clearml.conf , let's call them :
~/clearml_service.conf ~/clearml_agent.confThen in each one select a different venvs_dir see here:
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L90
for example:
~/.clearml/venvs-builds1 ~/.clearml/venvs-builds2Now start the two agents with:
The service agent:
clearml-agent --config-file ~/clearml_service.conf daemon ....And the "regular" agent:
clearml-agent --config-file ~/clearml_agent.conf daemon ....

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Yes, the difference in installed packages is large - the training stage, which runs ok has all the following:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

No another agent running

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

astunparse==1.6.3
attrs==20.3.0
botocore==1.19.63
cachetools==4.2.1
certifi==2020.12.5
chardet==4.0.0
cycler==0.10.0
Cython==0.29.21
furl==2.1.0
future==0.18.2
gast==0.3.3
google-auth==1.25.0
google-auth-oauthlib==0.4.2
google-pasta==0.2.0
grpcio==1.35.0
h5py==2.10.0
humanfriendly==9.1
idna==2.10
importlib-metadata==3.4.0
jmespath==0.10.0
jsonschema==3.2.0
Keras-Preprocessing==1.1.2
kiwisolver==1.3.1
Markdown==3.3.3
oauthlib==3.1.0
opt-einsum==3.3.0
orderedmultidict==1.0.1
pathlib2==2.3.5
pathtools==0.1.2
Pillow==8.1.0
protobuf==3.14.0
psutil==5.8.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
PyJWT==2.0.1
pyparsing==2.4.7
pyreadline==2.1
pyrsistent==0.17.3
python-dateutil==2.8.1
PyYAML==5.4.1
requests==2.25.1
requests-file==1.5.1
requests-oauthlib==1.3.0
rsa==4.7
s3transfer==0.3.4
scipy==1.6.0
six==1.15.0
tensorboard==2.2.2
tensorboard-plugin-wit==1.8.0
tensorflow-gpu-estimator==2.2.0
termcolor==1.1.0
threadpoolctl==2.1.0
typing-extensions==3.7.4.3
urllib3==1.26.3
watchdog==0.10.3
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.4.0

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

No, I mean actually compare using the UI, maybe the arguments are different or the "installed packages"

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Looks like that. First, I've created a toy task running in "services" queue (you didn't tell that but I guess you assumed). I haven't found how to specify the queue to run in code ( Task.equeue(task, queue_name='services') returned an error), so I ran toy.py first in "default" queue, aborted toy.py, started nntraining in "default" queue. Then I reset toy.py and enqueued it to "services" queue. Toy.py failed shortly. I've also reset both toy.py and nntraining and enqueued first toy.py (in "services" que) and then - nntraining (in "default" queue). In this case, nntraining failed. In both failed cases error is the same:
Traceback (most recent call last): File "c:\users\super\anaconda3\envs\tf22\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "c:\users\super\anaconda3\envs\tf22\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "c:\users\super\anaconda3\envs\tf22\lib\site-packages\virtualenv.py", line 2633, in <module> main() File "c:\users\super\anaconda3\envs\tf22\lib\site-packages\virtualenv.py", line 869, in main symlink=options.symlink, File "c:\users\super\anaconda3\envs\tf22\lib\site-packages\virtualenv.py", line 1161, in create_environment install_python(home_dir, lib_dir, inc_dir, bin_dir, site_packages=site_packages, clear=clear, symlink=symlink) File "c:\users\super\anaconda3\envs\tf22\lib\site-packages\virtualenv.py", line 1531, in install_python shutil.copyfile(executable, py_executable) File "c:\users\super\anaconda3\envs\tf22\lib\shutil.py", line 121, in copyfile with open(dst, 'wb') as fdst: PermissionError: [Errno 13] Permission denied: 'C:\\Users\\Super\\.clearml\\venvs-builds\\3.7\\Scripts\\python.exe' Using base prefix 'c:\\users\\super\\anaconda3\\envs\\tf22' No LICENSE.txt / LICENSE found in source New python executable in C:\Users\Super\.clearml\venvs-builds\3.7\Scripts\python.exe clearml_agent: ERROR: Command '['python', '-m', 'virtualenv', 'C:\\Users\\Super\\.clearml\\venvs-builds\\3.7']' returned non-zero exit status 1.Hence, the process, which runs first blocks the process, which runs second in another queue. The type of queue - either "default" or "services" doesn't play any role.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

BattyLion34 let me see if I understand.
The same base_task_id when cloned by the UI and enqueues on the same queue as the pipeline, will work but when the pipeline runs the same Task it fails?!
Could it be that you enqueue them on different queues ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks BattyLion34 I fixed the code snippet :)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BattyLion34
Maybe something inside the task is different?!
Could you run these lines and send me the result:
from clearml import Task print(Task.get_task(task_id='failing task id').export_task()) print(Task.get_task(task_id='working task id').export_task())

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

How can the first process corrupt the second

I think that something went wrong and both Agents are using the same "temp" folder to setup the experiment.

why doesn't this occur if I run pipeline from command line?

The services queue is creating new dockers with everything in them so they cannot step on each others toes (so to speak)

I run all the processes as administrator. However, I've tested running the pipeline from command line in non-administrator mode, it works fine.

Yes you are correct, no reason to run as Admin.

BattyLion34 Let me check regrading the "temp" folder, I think there is around it (and if there is a bug we will fix it regardless)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

BattyLion34 I have a theory, I think that any Task on the "default" queue qill fail if a Task is running on the "service" queue.
Could you create a toy Task that just print "." and sleeps for 5 seconds and then prints again.
Then while that Task is running, from the UI launch the Task that passed on the "default" queue. If my theory holds it should fail, then we will be getting somewhere 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Exactly! To be more specified - the same base_task_id fails, if the pipeline is cloned and started from UI. I've checked the queues for failed and completed tasks - they are the same (default, gpu-all).

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

That makes no sense to me?!
Are you absolutely sure the nntrain is executed on the same queue? (basically could it be that the nntraining is executed on a different queue in these two cases ?)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Completed task:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

Well, I'm pretty sure that nntraining is executed in the same queue for these two cases:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

Failed task:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

AgitatedDove14 According to the logs (up to traceback message), the only difference between those two tasks is task id name

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

Could you send me the cosnole log of both tasks, failing and passing one?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Here's also the log of failed pipeline - maybe it may give a clue.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

AgitatedDove14 It works!!! Thanks a lot!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

AgitatedDove14 How can the first process corrupt the second and why doesn't this occur if I run pipeline from command line? Just to be precise - I run all the processes as administrator. However, I've tested running the pipeline from command line in non-administrator mode, it works fine.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

BattyLion34 is this consistent?
(Really I can't see eny difference, one time it is able to create the venv and another it is failing with permission error)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

These libraries are absent in the option, which fails. The only libraries of that option (all are present in correct-working option) are:
absl_py==0.9.0
boto3==1.16.6
clearml==0.17.4
joblib==0.17.0
matplotlib==3.3.1
numpy==1.18.4
scikit_learn==0.23.2
tensorflow_gpu==2.2.0
watchdog==0.10.3

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

No, I have only two agents pulling from different queue:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BattyLion34
				
					0
					 × 1

YEY!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Show more results

Write your answer

143K Views

31 Answers

4 years ago

one year ago