Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Using

Using PipelineController I have task B that depends on task A, e.g. A -> B. Task B reads artifacts that were saved by task A. I did define the dependency between them using the parents parameter.

But task B fails with "file note found error" trying to load something task A saved. I double checked and this file does exist, trying to load it myself works...
This happens randomly, some executions work, some not - any idea what is going on?

  
  
Posted 2 years ago
Votes Newest

Answers 30


💪

  
  
Posted 2 years ago

WackyRabbit7 ,I am noticing that the files are saved locally, is there any chance that the files are over-written during the run or get deleted at some point and then replaced?

Also, is there a reason the files are being saved locally and not at the fileserver?

I couldn't manage to reproduce it on my end. But also in my cases it always saves the files to the fileserver. So I'm curious what's making it save locally in your case

  
  
Posted 2 years ago

WackyRabbit7 Hey, sorry for the delay 🙂 Hopefully I'll have an answer in a couple of hours

  
  
Posted 2 years ago

Another thing I noticed now it happens on my personal computer, when I execute the same pipeline from the exact same commit with exact same data on another host it works without these problems

  
  
Posted 2 years ago

WackyRabbit7 , can you try on another computer?

  
  
Posted 2 years ago

Did not solve this

  
  
Posted 2 years ago

WackyRabbit7 can you try upload the artifact with wait_on_upload=True ?

  
  
Posted 2 years ago

CostlyOstrich36

  
  
Posted 2 years ago

I will

  
  
Posted 2 years ago

a third one?

  
  
Posted 2 years ago

⬆ CostlyOstrich36

  
  
Posted 2 years ago

Loading part from task B:
` def get_models_from_task(task: clearml.Task, model_artifact_substring: str = 'iter_') -> dict:
"""
Extract all models saved as artifacts with the specified substring

:param task: Task to fetch from
:param model_artifact_substring: Substring for recognizing models among artifacts
:return: Mapping between iter number and model instance
"""

# Extract models from task (models are named iter-XXX where XXX is the iteration number)
model_paths = {int(k.split('_')[-1].split('.')[0]): v.get_local_copy() for k, v in task.artifacts.items() if
               model_artifact_substring in k}

# Read from pickle
models = dict()
for iter_num, pickle_path in model_paths.items():
    with open(pickle_path, 'rb') as f:
        models[iter_num] = pickle.load(f)

return models `
  
  
Posted 2 years ago

Saving part from task A:

pipeline = trials.trials[index]['result']['pipeline'] output_prefix = 'best_iter_' if i == 0 else 'iter_' task.upload_artifact(name=output_prefix + str(index), artifact_object=pipeline)

  
  
Posted 2 years ago

yep, just sec

  
  
Posted 2 years ago

Couldn't find any logic on which tasks fail and why... all the lines are exactly the same, only different parameters

  
  
Posted 2 years ago

Any news on this? This is kind of creepy, it's something so basic that I can't trust my prediction pipeline because sometimes it fails randomly with no reason

  
  
Posted 2 years ago

Please update me 🙂

  
  
Posted 2 years ago

If it works on two computers and one computer is having problems then I'll be suspecting some issue with the computer itself. Maybe permissions or network issues

  
  
Posted 2 years ago

this is another execution:

  
  
Posted 2 years ago

WackyRabbit7 Hi!

Which version of clearml are you using? Also, can you give a snippet example?
Are you seeing anything in the logs regarding this?

  
  
Posted 2 years ago

Version 1.1.1

Snippet of which part exactly?

Yeah, logs saying "file not found", here is an example

  
  
Posted 2 years ago

moreover, in each pipeline I have 10 different settings of task A -> Task b (and then task C), each run 1-2 fails randomly

  
  
Posted 2 years ago

task here is a ClearML task object

  
  
Posted 2 years ago

For example, this is one execution:

  
  
Posted 2 years ago

an example of the part of you saving the files and loading the files. I'm assuming that all files are saved locally?

  
  
Posted 2 years ago

Thanks

  
  
Posted 2 years ago

2021-10-11 10:07:19 ClearML results page: 2021-10-11 10:07:20 Traceback (most recent call last): File "tasks/hpo_n_best_evaluation.py", line 256, in <module> main(args, task) File "tasks/hpo_n_best_evaluation.py", line 164, in main trained_models = get_models_from_task(task=hpo_task) File "tasks/hpo_n_best_evaluation.py", line 72, in get_models_from_task with open(pickle_path, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/elior/.clearml/cache/storage_manager/global/636eedc960be00b7c2ec61657ae6c200.best_iter_184.pkl'

  
  
Posted 2 years ago

I double checked that the task object is the correct one

  
  
Posted 2 years ago

WackyRabbit7 aight, thx for the info I'll take a look 🙂

  
  
Posted 2 years ago

I am noticing that the files are saved locally, is there any chance that the files are over-written during the run or get deleted at some point and then replaced?
Yes they are local - I don't think there is a possibility they are getting overwritten... But that depends on how clearml names them. I showed you the code that saves the artifacts, but this code runs multiple times from a given template with different values - essentially it creates like 10 times the same task with different parameters... At the end, in the ClearML UI I see 10 different executions. I want to believe that for each one of these 10, there is a different space on the disk - if that is true, they shouldn't be overwritten or deleted

Also, is there a reason the files are being saved locally and not at the fileserver?
Our deployment is local, we execute and have the server on the same machine

  
  
Posted 2 years ago
534 Views
30 Answers
2 years ago
one year ago
Tags