If it works on two computers and one computer is having problems then I'll be suspecting some issue with the computer itself. Maybe permissions or network issues
WackyRabbit7 ,I am noticing that the files are saved locally, is there any chance that the files are over-written during the run or get deleted at some point and then replaced?
Also, is there a reason the files are being saved locally and not at the fileserver?
I couldn't manage to reproduce it on my end. But also in my cases it always saves the files to the fileserver. So I'm curious what's making it save locally in your case
I am noticing that the files are saved locally, is there any chance that the files are over-written during the run or get deleted at some point and then replaced?
Yes they are local - I don't think there is a possibility they are getting overwritten... But that depends on how clearml names them. I showed you the code that saves the artifacts, but this code runs multiple times from a given template with different values - essentially it creates like 10 times the same task with different parameters... At the end, in the ClearML UI I see 10 different executions. I want to believe that for each one of these 10, there is a different space on the disk - if that is true, they shouldn't be overwritten or deleted
Also, is there a reason the files are being saved locally and not at the fileserver?
Our deployment is local, we execute and have the server on the same machine
an example of the part of you saving the files and loading the files. I'm assuming that all files are saved locally?
WackyRabbit7 Hi!
Which version of clearml are you using? Also, can you give a snippet example?
Are you seeing anything in the logs regarding this?
2021-10-11 10:07:19 ClearML results page:
2021-10-11 10:07:20 Traceback (most recent call last): File "tasks/hpo_n_best_evaluation.py", line 256, in <module> main(args, task) File "tasks/hpo_n_best_evaluation.py", line 164, in main trained_models = get_models_from_task(task=hpo_task) File "tasks/hpo_n_best_evaluation.py", line 72, in get_models_from_task with open(pickle_path, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/elior/.clearml/cache/storage_manager/global/636eedc960be00b7c2ec61657ae6c200.best_iter_184.pkl'
I double checked that the task
object is the correct one
Loading part from task B:
` def get_models_from_task(task: clearml.Task, model_artifact_substring: str = 'iter_') -> dict:
"""
Extract all models saved as artifacts with the specified substring
:param task: Task to fetch from
:param model_artifact_substring: Substring for recognizing models among artifacts
:return: Mapping between iter number and model instance
"""
# Extract models from task (models are named iter-XXX where XXX is the iteration number)
model_paths = {int(k.split('_')[-1].split('.')[0]): v.get_local_copy() for k, v in task.artifacts.items() if
model_artifact_substring in k}
# Read from pickle
models = dict()
for iter_num, pickle_path in model_paths.items():
with open(pickle_path, 'rb') as f:
models[iter_num] = pickle.load(f)
return models `
Couldn't find any logic on which tasks fail and why... all the lines are exactly the same, only different parameters
WackyRabbit7 aight, thx for the info I'll take a look 🙂
Saving part from task A:
pipeline = trials.trials[index]['result']['pipeline'] output_prefix = 'best_iter_' if i == 0 else 'iter_' task.upload_artifact(name=output_prefix + str(index), artifact_object=pipeline)
Any news on this? This is kind of creepy, it's something so basic that I can't trust my prediction pipeline because sometimes it fails randomly with no reason
moreover, in each pipeline I have 10 different settings of task A -> Task b (and then task C), each run 1-2 fails randomly
WackyRabbit7 can you try upload the artifact with wait_on_upload=True
?
Another thing I noticed now it happens on my personal computer, when I execute the same pipeline from the exact same commit with exact same data on another host it works without these problems
WackyRabbit7 , can you try on another computer?
WackyRabbit7 Hey, sorry for the delay 🙂 Hopefully I'll have an answer in a couple of hours
Version 1.1.1
Snippet of which part exactly?
Yeah, logs saying "file not found", here is an example