Hi, I Am Trying To Clone An Experiment. Using The Server Gui, I Select 'Clone' And Then 'Enqueue'. In The Console Window, I See That Clearml Makes Sure The Environment Is Installed, And Then It Goes Into A 'Completed' Status Although The Experiment Did N

Answered

Hi, I am trying to clone an experiment. Using the server GUI, I select 'clone' and then 'enqueue'. In the console window, I see that clearml makes sure the environment is installed, and then it goes into a 'completed' status although the experiment did not run.

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

Votes Newest

Answers 28

Hi RotundSquirrel78
How did you end up with this command line?
/home/sigalr/.clearml/venvs-builds/3.8/code/unet_sindiff_1_level_2_resblk --dataset humanml --device 0 --arch unet --channel_mult 1 --num_res_blocks 2 --use_scale_shift_norm --use_checkpoint --num_steps 300000the arguments passed are odd (there should be none, they are passed inside the execution) and I suspect this is the issue

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Bingo (I guess). My code is local, with multiple files. I will try to connect it to a git repo and let you know how it worked.
Does the agent support uncommitted changes in multiple files? (on-top of a git commit).

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

Who/What created the initial experiment ?

I created the initial experiment from command-line, with either "python folder/script.py" or "python -m folder.script".
Both end up with the experiment not running. I am attaching an agent daemon log where the initial experiment was called with "python folder/script.py".

Why isn't the entry point just the python script?

The entry point is folder.script and not just the script because I need the 'current' folder while running the script to be project root, so importing other packages in the project will work properly.

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

Are you saying you had that odd script entry-point created by calling Task.init? (To clarify this is the problem)
Btw after you clone the experiment you can always manually edit both entry point and working dir, which based on what you said should be "script.py" and "folder"

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Could you upload the log so I can have a look?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TimelyMouse69
				
					0
					 × 1

This seems to be the issue:
PYTHONPATH = '.'How is that happening ?
Can you try to run the agent with:
PYTHONPATH= clearml-agent daemon ....(Notice the prefix PYTHONPATH= clears the environment variable that obviously fails the python commands)

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

As written above, I did the right click clone, then I did right click enqueue.
The experiment reported 'running', and immediately after preparing the environment it reported 'completed', without actually running my code. Please look at the beginning of this thread for output logs and more details.

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

AgitatedDove14 , thank you so much for your help.
I had a long video session today with the Israeli clearml engineers. There were plenty of things I had to do, and the two major ones were to define the environment variable CLEARML_AGENT_SKIP_PIP_VENV_INSTALL so it points to my conda environment python, and to call 'import clearml' from the top of my file (it was called from inside a method).
So now I can clone 🎉

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

TimelyMouse69 , yes, I ran successfully the first time before cloning it.

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

FileNotFoundError: [Errno 2] No such file or directoryCould it be the file you are trying to run is not in the repository ?
Are you running inside a docker ?
Any chance you can send the full log ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The only thing I need to do is clone my experiment. Can you help me make this happen?

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

AgitatedDove14 , I noticed that if I run the initial experiment by "python -m folder_name.script_name" then the script path contains the whole list of arguments as you observed.
On the other hand, if I run the initial experiment by "python folder_name/script_name.py", then the script path contains only 'script_name.py'.
In both cases I cannot clone the experiment, with the same results as I reported in my initial message.

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

Any chance your code needs more than the main script, but it is Not in a git repo? Because the agent supports either single script file, or a git repo with multiple files

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Could it be the file you are trying to run is not in the repository ?

It is unclear what file is missing. The only hint is "Keyerror: '.'" and I am not sure what that refers to. All my code files are in the repository. Maybe the problem is with some installed package file?

Are you running inside a docker ?

No, I am running inside a conda environment.

Any chance you can send the full log ? (edited)

What I sent is the full agent daemon log. If you are asking for the console output, then it is attached.

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

woot woot, glad to hear that!

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

That's pretty weird. I don't see any clear indications something is wrong, it simply doesn't execute the rest it would seem. Did it successfully run the first time before cloning it?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TimelyMouse69
				
					0
					 × 1

As you suggested, I tried with a git repository. Got a completely different error. Attached is the log file. Any idea what's wrong?

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

Yes it does, but these files must be committed to begin with, basically think 'git diff' output is stored and then the agent applies it

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

But the python command does not have such arguments (--script, --cwd). What am I missing?
Or, do you mean that those should be added to the Args list when cloning?

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

Attached are the agent log and the task log

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

Great if this is what you do how come you need to change the entry script in the ui?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If you wan to change the Args, go to the Args section in the Configuration tab, when the Task is in draft mode you can edit them there

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

As you said you just need to clone, righr click clone?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I see such arguments (--script, --cwd) in the command 'clearml-task', but I am not using it. What I do is run my script ('python folder/script.py') and create a task inside it, using Task.init().

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

AgitatedDove14 , I did nothing to generate a command-line. Just cloned the experiment and enqueued it. Used the server GUI.

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

I did nothing to generate a command-line. Just cloned the experiment and enqueued it. Used the server GUI.

Who/What created the initial experiment ?

I noticed that if I run the initial experiment by "python -m folder_name.script_name"

"-m module" as script entry is used to launch entry points like python modules (which is translated to "python -m script")
Why isn't the entry point just the python script?
The command line arguments are passed as arguments on the Args section of the Configuration section

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh I see, what you need is to pass '--script script.py' as entry-point and ' --cwd folder' as working dir

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, I create the experiment by calling Task.init.
As you suggested, in the experiment tab I define the script path and the working directory.
Again, the task only created the environment and after that reported 'completed' without running my code.
Attaching the log of the last run, with the setting of the script and the folder.

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundSquirrel78
				
					0
					 × 1

Write your answer

1K Views

28 Answers

2 years ago