How can the first process corrupt the second
I think that something went wrong and both Agents are using the same "temp" folder to setup the experiment.
why doesn't this occur if I run pipeline from command line?
The services queue is creating new dockers with everything in them so they cannot step on each others toes (so to speak)
I run all the processes as administrator. However, I've tested running the pipeline from command line in non-administrator mode, it works fine.
Yes you are correct, no reason to run as Admin.
BattyLion34 Let me check regrading the "temp" folder, I think there is around it (and if there is a bug we will fix it regardless)
AgitatedDove14 How can the first process corrupt the second and why doesn't this occur if I run pipeline from command line? Just to be precise - I run all the processes as administrator. However, I've tested running the pipeline from command line in non-administrator mode, it works fine.
AgitatedDove14 Yes, that's what I have - for me it's weird, too.
BattyLion34 I have a theory, I think that any Task on the "default" queue qill fail if a Task is running on the "service" queue.
Could you create a toy Task that just print "." and sleeps for 5 seconds and then prints again.
Then while that Task is running, from the UI launch the Task that passed on the "default" queue. If my theory holds it should fail, then we will be getting somewhere 🙂
BattyLion34
Maybe something inside the task is different?!
Could you run these lines and send me the result:from clearml import Task print(Task.get_task(task_id='failing task id').export_task()) print(Task.get_task(task_id='working task id').export_task())
These libraries are absent in the option, which fails. The only libraries of that option (all are present in correct-working option) are:
absl_py==0.9.0
boto3==1.16.6
clearml==0.17.4
joblib==0.17.0
matplotlib==3.3.1
numpy==1.18.4
scikit_learn==0.23.2
tensorflow_gpu==2.2.0
watchdog==0.10.3
BattyLion34 let me see if I understand.
The same base_task_id when cloned by the UI and enqueues on the same queue as the pipeline, will work but when the pipeline runs the same Task it fails?!
Could it be that you enqueue them on different queues ?
Ok, ran (just used point instead of comma in print statement - comment if someone reading this will run this code). Attached to this message.
Thanks BattyLion34 I fixed the code snippet :)
BattyLion34 is this consistent?
(Really I can't see eny difference, one time it is able to create the venv and another it is failing with permission error)
Here's also the log of failed pipeline - maybe it may give a clue.
No, I have only two agents pulling from different queue:
astunparse==1.6.3
attrs==20.3.0
botocore==1.19.63
cachetools==4.2.1
certifi==2020.12.5
chardet==4.0.0
cycler==0.10.0
Cython==0.29.21
furl==2.1.0
future==0.18.2
gast==0.3.3
google-auth==1.25.0
google-auth-oauthlib==0.4.2
google-pasta==0.2.0
grpcio==1.35.0
h5py==2.10.0
humanfriendly==9.1
idna==2.10
importlib-metadata==3.4.0
jmespath==0.10.0
jsonschema==3.2.0
Keras-Preprocessing==1.1.2
kiwisolver==1.3.1
Markdown==3.3.3
oauthlib==3.1.0
opt-einsum==3.3.0
orderedmultidict==1.0.1
pathlib2==2.3.5
pathtools==0.1.2
Pillow==8.1.0
protobuf==3.14.0
psutil==5.8.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
PyJWT==2.0.1
pyparsing==2.4.7
pyreadline==2.1
pyrsistent==0.17.3
python-dateutil==2.8.1
PyYAML==5.4.1
requests==2.25.1
requests-file==1.5.1
requests-oauthlib==1.3.0
rsa==4.7
s3transfer==0.3.4
scipy==1.6.0
six==1.15.0
tensorboard==2.2.2
tensorboard-plugin-wit==1.8.0
tensorflow-gpu-estimator==2.2.0
termcolor==1.1.0
threadpoolctl==2.1.0
typing-extensions==3.7.4.3
urllib3==1.26.3
watchdog==0.10.3
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.4.0
BattyLion34
if I simply clone nntraining stage and run it in default queue - everything goes fine.
When you compare the Task you clone manually and the Task created by the pipeline , what's the difference ?
Could you send me the cosnole log of both tasks, failing and passing one?
No, I mean actually compare using the UI, maybe the arguments are different or the "installed packages"
Do you have two agents pulling from the same queue ?
Maybe one of them is configured differently ?
Exactly! To be more specified - the same base_task_id fails, if the pipeline is cloned and started from UI. I've checked the queues for failed and completed tasks - they are the same (default, gpu-all).
That makes no sense to me?!
Are you absolutely sure the nntrain is executed on the same queue? (basically could it be that the nntraining is executed on a different queue in these two cases ?)
Hi BattyLion34
I might have a solution, in order to make sure the two agents are not sharing the "temp" folder:
create two copies of ~/clearml.conf , let's call them :
~/clearml_service.conf ~/clearml_agent.confThen in each one select a different venvs_dir
see here:
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L90
for example:
~/.clearml/venvs-builds1 ~/.clearml/venvs-builds2Now start the two agents with:
The service agent:clearml-agent --config-file ~/clearml_service.conf daemon ....
And the "regular" agent:clearml-agent --config-file ~/clearml_agent.conf daemon ....
AgitatedDove14 Looks like that. First, I've created a toy task running in "services" queue (you didn't tell that but I guess you assumed). I haven't found how to specify the queue to run in code ( Task.equeue(task, queue_name='services')
returned an error), so I ran toy.py first in "default" queue, aborted toy.py, started nntraining in "default" queue. Then I reset toy.py and enqueued it to "services" queue. Toy.py failed shortly. I've also reset both toy.py and nntraining and enqueued first toy.py (in "services" que) and then - nntraining (in "default" queue). In this case, nntraining failed. In both failed cases error is the same:Traceback (most recent call last): File "c:\users\super\anaconda3\envs\tf22\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "c:\users\super\anaconda3\envs\tf22\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "c:\users\super\anaconda3\envs\tf22\lib\site-packages\virtualenv.py", line 2633, in <module> main() File "c:\users\super\anaconda3\envs\tf22\lib\site-packages\virtualenv.py", line 869, in main symlink=options.symlink, File "c:\users\super\anaconda3\envs\tf22\lib\site-packages\virtualenv.py", line 1161, in create_environment install_python(home_dir, lib_dir, inc_dir, bin_dir, site_packages=site_packages, clear=clear, symlink=symlink) File "c:\users\super\anaconda3\envs\tf22\lib\site-packages\virtualenv.py", line 1531, in install_python shutil.copyfile(executable, py_executable) File "c:\users\super\anaconda3\envs\tf22\lib\shutil.py", line 121, in copyfile with open(dst, 'wb') as fdst: PermissionError: [Errno 13] Permission denied: 'C:\\Users\\Super\\.clearml\\venvs-builds\\3.7\\Scripts\\python.exe' Using base prefix 'c:\\users\\super\\anaconda3\\envs\\tf22' No LICENSE.txt / LICENSE found in source New python executable in C:\Users\Super\.clearml\venvs-builds\3.7\Scripts\python.exe clearml_agent: ERROR: Command '['python', '-m', 'virtualenv', 'C:\\Users\\Super\\.clearml\\venvs-builds\\3.7']' returned non-zero exit status 1.
Hence, the process, which runs first blocks the process, which runs second in another queue. The type of queue - either "default" or "services" doesn't play any role.
BattyLion34 Okay, I'll try to see if we can solve the multi-instance issue on Windows (because obviously it should be automatic)
AgitatedDove14 According to the logs (up to traceback message), the only difference between those two tasks is task id name
AgitatedDove14 Yes, the difference in installed packages is large - the training stage, which runs ok has all the following: