My use case is that the code using pytorch saves additional info like the state dict when saving the model. I'd like to save that information as an artifact as well so that I can load it later.
I actually just asked about this in another thread. Here's the link. Asking about the usage of the upload_artifact
I just made a custom repo from the ultralytics yolov5 repo, where I get data and model using data id and model id.
This is the original repo which I've slightly modified.
You're suggesting that the false is considered a string and not a bool?
The clearml-server always stores the values as strings (serializing them), the casting is done when passed back to the code in runtime. The issue here is there is actually no "way" to tell the argparser this is a boolean (basically any value that will be passed is treated as string). What I think we should do is fix the casting function so that if this is exatcly the same value we use the default value (i.e. boolean). does that make sense to you?
Okay let me see if I can think of something...
Basically crashing on the assertion here ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L495
Could it be your are passing "Args/resume" True, but not specifying the checkpoint ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L452
I think I know what's going on:
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L452
does not specify a type: specifically it can take boolean or string pointing to a Path.
What's going on is that it gets "False" as string from the cloned Task, then the automagic does not cast it to boolean because the definition of the args is "nartgs="?"" and no type, so it leaves it as string, then the assertion fails because it checks if this is a string.
Does that make sense ?
I think I understand. Still I've possibly pinned down the issue to something else. I'm not sure if i'll be able to fix it on my own though.
Now when I try to clone it, and run it on an agent, it fails to install the requirements.
Can you share the log? what fails exactly?
The situation is such that I needed a continuous training pipeline to train a detector, the detector being Ultralytics Yolo V5.
To me, it made sense that I would have a training task. The whole training code seemed complex to me so I just modified it just a bit to fit my needs of it getting dataset and model from clearml. Nothing more.
I think created a task using clearml-task and pointed it towards the repo I had created. The task runs fine.
I am unsure at the details of the training code itself as I'm not very well versed with pytorch. I'm also troubled by the fact that it always runs fine when I create a task from the repo using clearml-task, however when I clone or reset said task after completion and then enqueue it again, I get the above error.
It runs into the above error when I clone the task or reset it.
from here:
AssertionError: ERROR: --resume checkpoint does not exist
I assume the "internal" code state changed, and now it is looking for a file that does not exist, how would your code state change, in other words why would it be looking for the file only when cloning? could it be you put the state on the Task, then you clone it (i.e. clone the exact same dict, and now the newly cloned Task "thinks" it resuming ?!)
Anyway in the resume argument, there is a default=False however const=True, what's up with that, or is const a separate parameter
However cloning it uses it from the clearml args, which somehow converts it to string?
I download the dataset and model, and load them. Before training them again.
I've basically just added dataset id and model id parameters in the args.
From what I recall, I think resume was set to false, originally and in the cloned task.
You're suggesting that the false is considered a string and not a bool? Am I understanding it correctly? Also, in that case, wouldn't this problem also occur when I originally create the task using clearml-task?
Or am I not understanding it clearly.
when i pass the repo in clearml-task with the parameters, it runs fine and finishes. Basically when I clone and attempt the task again, I get the above assert error I don't know why.
however when I clone or reset said task after completion and then enqueue it again, I get the above error.
This part is somewhat confusing... There is no magic happening behind the scenes, cloning a Task and creating it, is basically the same ... Do you have a reference to the YOLOv5 code base itself, maybe I can figure out what's the issue?
up to date with https://fawad_nizamani@bitbucket.org/fawad_nizamani/custom_yolov5 ✅
Traceback (most recent call last):
File "train.py", line 682, in <module>
main(opt)
File "train.py", line 525, in main
assert os.path.isfile(ckpt), 'ERROR: --resume checkpoint does not exist'
AssertionError: ERROR: --resume checkpoint does not exist
Another issue I'm having is I ran a task using clearml-task and did it using a repo. It runs fine, when I clone said task however and run it on the same queue again, it throws an error from the code. I can't seem to figure out why its happening.
Oh oh oh. Wait a second. I think I get what you're saying. When I'm originally creating clearml-task, since I'm not passing the argument myself, so it just uses the value False.
so when I run the task using clearml-task --repo and create a task, it runs fine. It runs into the above error when I clone the task or reset it.
I shared the error above. I'm simply trying to make the yolov5 by ultralytics part of my pipeline.
for which I basically forked it for myself. and made it accept clearml dataset and model ids to use.
Anyway, in the docs, there is a function called task.register_artifact()
Anyway, in the docs, there is a function called task.register_artifact()
Yes, this is rather deprecated... The idea is that it will monitor an obejct and auto sync it (i.e. serialize and upload).
That said, it is just so much easier to do task.upload_artifact
and you can always update/overrwrite if you are passing the same name, that I cannot see the actual use case. Does that make sense? What are you using it for ?
I'm dumping a dict to json, how can i register that dict as an artifact