Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey Guys, Sorry For The Rapid Fire Questions In The Past Few Days. I Have Another Issue Though. I Initially Ran A Task, Directly From A Repo. It Succesfully Installed The Requirements From The Requirements File In The Repo And Ran The Task Without Any Iss

Hey guys, sorry for the rapid fire questions in the past few days. I have another issue though. I initially ran a task, directly from a repo. It succesfully installed the requirements from the requirements file in the repo and ran the task without any issue. Now when I try to clone it, and run it on an agent, it fails to install the requirements.

  
  
Posted 2 years ago
Votes Newest

Answers 29


Another issue I'm having is I ran a task using clearml-task and did it using a repo. It runs fine, when I clone said task however and run it on the same queue again, it throws an error from the code. I can't seem to figure out why its happening.

  
  
Posted 2 years ago

up to date with https://fawad_nizamani@bitbucket.org/fawad_nizamani/custom_yolov5
Traceback (most recent call last):
File "train.py", line 682, in <module>
main(opt)
File "train.py", line 525, in main
assert os.path.isfile(ckpt), 'ERROR: --resume checkpoint does not exist'
AssertionError: ERROR: --resume checkpoint does not exist

  
  
Posted 2 years ago

I just made a custom repo from the ultralytics yolov5 repo, where I get data and model using data id and model id.

  
  
Posted 2 years ago

when i pass the repo in clearml-task with the parameters, it runs fine and finishes. Basically when I clone and attempt the task again, I get the above assert error I don't know why.

  
  
Posted 2 years ago

Now when I try to clone it, and run it on an agent, it fails to install the requirements.

Can you share the log? what fails exactly?

  
  
Posted 2 years ago

I shared the error above. I'm simply trying to make the yolov5 by ultralytics part of my pipeline.

  
  
Posted 2 years ago

for which I basically forked it for myself. and made it accept clearml dataset and model ids to use.

  
  
Posted 2 years ago

so when I run the task using clearml-task --repo and create a task, it runs fine. It runs into the above error when I clone the task or reset it.

  
  
Posted 2 years ago

It runs into the above error when I clone the task or reset it.

from here:

AssertionError: ERROR: --resume checkpoint does not exist

I assume the "internal" code state changed, and now it is looking for a file that does not exist, how would your code state change, in other words why would it be looking for the file only when cloning? could it be you put the state on the Task, then you clone it (i.e. clone the exact same dict, and now the newly cloned Task "thinks" it resuming ?!)

  
  
Posted 2 years ago

The situation is such that I needed a continuous training pipeline to train a detector, the detector being Ultralytics Yolo V5.

To me, it made sense that I would have a training task. The whole training code seemed complex to me so I just modified it just a bit to fit my needs of it getting dataset and model from clearml. Nothing more.

I think created a task using clearml-task and pointed it towards the repo I had created. The task runs fine.

I am unsure at the details of the training code itself as I'm not very well versed with pytorch. I'm also troubled by the fact that it always runs fine when I create a task from the repo using clearml-task, however when I clone or reset said task after completion and then enqueue it again, I get the above error.

  
  
Posted 2 years ago

however when I clone or reset said task after completion and then enqueue it again, I get the above error.

This part is somewhat confusing... There is no magic happening behind the scenes, cloning a Task and creating it, is basically the same ... Do you have a reference to the YOLOv5 code base itself, maybe I can figure out what's the issue?

  
  
Posted 2 years ago

This is the original repo which I've slightly modified.

  
  
Posted 2 years ago

I've basically just added dataset id and model id parameters in the args.

  
  
Posted 2 years ago

I download the dataset and model, and load them. Before training them again.

  
  
Posted 2 years ago

Okay let me see if I can think of something...
Basically crashing on the assertion here ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L495
Could it be your are passing "Args/resume" True, but not specifying the checkpoint ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L452
I think I know what's going on:
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L452
does not specify a type: specifically it can take boolean or string pointing to a Path.
What's going on is that it gets "False" as string from the cloned Task, then the automagic does not cast it to boolean because the definition of the args is "nartgs="?"" and no type, so it leaves it as string, then the assertion fails because it checks if this is a string.
Does that make sense ?

  
  
Posted 2 years ago

From what I recall, I think resume was set to false, originally and in the cloned task.

  
  
Posted 2 years ago

You're suggesting that the false is considered a string and not a bool? Am I understanding it correctly? Also, in that case, wouldn't this problem also occur when I originally create the task using clearml-task?

Or am I not understanding it clearly.

  
  
Posted 2 years ago

Anyway in the resume argument, there is a default=False however const=True, what's up with that, or is const a separate parameter

  
  
Posted 2 years ago

Oh oh oh. Wait a second. I think I get what you're saying. When I'm originally creating clearml-task, since I'm not passing the argument myself, so it just uses the value False.

  
  
Posted 2 years ago

However cloning it uses it from the clearml args, which somehow converts it to string?

  
  
Posted 2 years ago

You're suggesting that the false is considered a string and not a bool?

The clearml-server always stores the values as strings (serializing them), the casting is done when passed back to the code in runtime. The issue here is there is actually no "way" to tell the argparser this is a boolean (basically any value that will be passed is treated as string). What I think we should do is fix the casting function so that if this is exatcly the same value we use the default value (i.e. boolean). does that make sense to you?

  
  
Posted 2 years ago

I think I understand. Still I've possibly pinned down the issue to something else. I'm not sure if i'll be able to fix it on my own though.

  
  
Posted 2 years ago

Anyway, in the docs, there is a function called task.register_artifact()

  
  
Posted 2 years ago

Takes in a name and an artifact object.

  
  
Posted 2 years ago

I'm dumping a dict to json, how can i register that dict as an artifact

  
  
Posted 2 years ago

Anyway, in the docs, there is a function called task.register_artifact()

Yes, this is rather deprecated... The idea is that it will monitor an obejct and auto sync it (i.e. serialize and upload).
That said, it is just so much easier to do task.upload_artifact and you can always update/overrwrite if you are passing the same name, that I cannot see the actual use case. Does that make sense? What are you using it for ?

  
  
Posted 2 years ago

https://clearml.slack.com/archives/CTK20V944/p1641282791308500?thread_ts=1640767505.214000&cid=CTK20V944

I actually just asked about this in another thread. Here's the link. Asking about the usage of the upload_artifact

  
  
Posted 2 years ago

My use case is that the code using pytorch saves additional info like the state dict when saving the model. I'd like to save that information as an artifact as well so that I can load it later.

  
  
Posted 2 years ago