Using

Answered

Using

Using PipelineController I have task B that depends on task A, e.g. A -> B. Task B reads artifacts that were saved by task A. I did define the dependency between them using the parents parameter.

But task B fails with "file note found error" trying to load something task A saved. I double checked and this file does exist, trying to load it myself works...
This happens randomly, some executions work, some not - any idea what is going on?

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Votes Newest

Answers 30

WackyRabbit7 ,I am noticing that the files are saved locally, is there any chance that the files are over-written during the run or get deleted at some point and then replaced?

Also, is there a reason the files are being saved locally and not at the fileserver?

I couldn't manage to reproduce it on my end. But also in my cases it always saves the files to the fileserver. So I'm curious what's making it save locally in your case

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

WackyRabbit7 can you try upload the artifact with wait_on_upload=True ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Any news on this? This is kind of creepy, it's something so basic that I can't trust my prediction pipeline because sometimes it fails randomly with no reason

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

WackyRabbit7 aight, thx for the info I'll take a look 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

this is another execution:

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Saving part from task A:

pipeline = trials.trials[index]['result']['pipeline'] output_prefix = 'best_iter_' if i == 0 else 'iter_' task.upload_artifact(name=output_prefix + str(index), artifact_object=pipeline)

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Did not solve this

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Version 1.1.1

Snippet of which part exactly?

Yeah, logs saying "file not found", here is an example

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

CostlyOstrich36

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

yep, just sec

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

task here is a ClearML task object

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

⬆ CostlyOstrich36

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Loading part from task B:
` def get_models_from_task(task: clearml.Task, model_artifact_substring: str = 'iter_') -> dict:
"""
Extract all models saved as artifacts with the specified substring

:param task: Task to fetch from
:param model_artifact_substring: Substring for recognizing models among artifacts
:return: Mapping between iter number and model instance
"""

# Extract models from task (models are named iter-XXX where XXX is the iteration number)
model_paths = {int(k.split('_')[-1].split('.')[0]): v.get_local_copy() for k, v in task.artifacts.items() if
               model_artifact_substring in k}

# Read from pickle
models = dict()
for iter_num, pickle_path in model_paths.items():
    with open(pickle_path, 'rb') as f:
        models[iter_num] = pickle.load(f)

return models `

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

💪

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

If it works on two computers and one computer is having problems then I'll be suspecting some issue with the computer itself. Maybe permissions or network issues

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

WackyRabbit7 Hi!

Which version of clearml are you using? Also, can you give a snippet example?
Are you seeing anything in the logs regarding this?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Another thing I noticed now it happens on my personal computer, when I execute the same pipeline from the exact same commit with exact same data on another host it works without these problems

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

For example, this is one execution:

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Please update me 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

I will

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

a third one?

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Couldn't find any logic on which tasks fail and why... all the lines are exactly the same, only different parameters

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

I double checked that the task object is the correct one

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

WackyRabbit7 , can you try on another computer?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

moreover, in each pipeline I have 10 different settings of task A -> Task b (and then task C), each run 1-2 fails randomly

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

an example of the part of you saving the files and loading the files. I'm assuming that all files are saved locally?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

I am noticing that the files are saved locally, is there any chance that the files are over-written during the run or get deleted at some point and then replaced?
Yes they are local - I don't think there is a possibility they are getting overwritten... But that depends on how clearml names them. I showed you the code that saves the artifacts, but this code runs multiple times from a given template with different values - essentially it creates like 10 times the same task with different parameters... At the end, in the ClearML UI I see 10 different executions. I want to believe that for each one of these 10, there is a different space on the disk - if that is true, they shouldn't be overwritten or deleted

Also, is there a reason the files are being saved locally and not at the fileserver?
Our deployment is local, we execute and have the server on the same machine

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

WackyRabbit7 Hey, sorry for the delay 🙂 Hopefully I'll have an answer in a couple of hours

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

2021-10-11 10:07:19 ClearML results page: 2021-10-11 10:07:20 Traceback (most recent call last): File "tasks/hpo_n_best_evaluation.py", line 256, in <module> main(args, task) File "tasks/hpo_n_best_evaluation.py", line 164, in main trained_models = get_models_from_task(task=hpo_task) File "tasks/hpo_n_best_evaluation.py", line 72, in get_models_from_task with open(pickle_path, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/elior/.clearml/cache/storage_manager/global/636eedc960be00b7c2ec61657ae6c200.best_iter_184.pkl'

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Thanks

  				
Posted 
	3 years ago

					More  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Write your answer

1K Views

30 Answers

3 years ago

2 years ago