Hey Guys, Sorry For The Rapid Fire Questions In The Past Few Days. I Have Another Issue Though. I Initially Ran A Task, Directly From A Repo. It Succesfully Installed The Requirements From The Requirements File In The Repo And Ran The Task Without Any Iss

Answered

Hey guys, sorry for the rapid fire questions in the past few days. I have another issue though. I initially ran a task, directly from a repo. It succesfully installed the requirements from the requirements file in the repo and ran the task without any issue. Now when I try to clone it, and run it on an agent, it fails to install the requirements.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Votes Newest

Answers 29

I just made a custom repo from the ultralytics yolov5 repo, where I get data and model using data id and model id.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

The situation is such that I needed a continuous training pipeline to train a detector, the detector being Ultralytics Yolo V5.

To me, it made sense that I would have a training task. The whole training code seemed complex to me so I just modified it just a bit to fit my needs of it getting dataset and model from clearml. Nothing more.

I think created a task using clearml-task and pointed it towards the repo I had created. The task runs fine.

I am unsure at the details of the training code itself as I'm not very well versed with pytorch. I'm also troubled by the fact that it always runs fine when I create a task from the repo using clearml-task, however when I clone or reset said task after completion and then enqueue it again, I get the above error.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

This is the original repo which I've slightly modified.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

It runs into the above error when I clone the task or reset it.

from here:

AssertionError: ERROR: --resume checkpoint does not exist

I assume the "internal" code state changed, and now it is looking for a file that does not exist, how would your code state change, in other words why would it be looking for the file only when cloning? could it be you put the state on the Task, then you clone it (i.e. clone the exact same dict, and now the newly cloned Task "thinks" it resuming ?!)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Anyway in the resume argument, there is a default=False however const=True, what's up with that, or is const a separate parameter

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

My use case is that the code using pytorch saves additional info like the state dict when saving the model. I'd like to save that information as an artifact as well so that I can load it later.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

You're suggesting that the false is considered a string and not a bool? Am I understanding it correctly? Also, in that case, wouldn't this problem also occur when I originally create the task using clearml-task?

Or am I not understanding it clearly.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

https://clearml.slack.com/archives/CTK20V944/p1641282791308500?thread_ts=1640767505.214000&cid=CTK20V944

I actually just asked about this in another thread. Here's the link. Asking about the usage of the upload_artifact

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Oh oh oh. Wait a second. I think I get what you're saying. When I'm originally creating clearml-task, since I'm not passing the argument myself, so it just uses the value False.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

so when I run the task using clearml-task --repo and create a task, it runs fine. It runs into the above error when I clone the task or reset it.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Okay let me see if I can think of something...
Basically crashing on the assertion here ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L495
Could it be your are passing "Args/resume" True, but not specifying the checkpoint ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L452
I think I know what's going on:
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L452
does not specify a type: specifically it can take boolean or string pointing to a Path.
What's going on is that it gets "False" as string from the cloned Task, then the automagic does not cast it to boolean because the definition of the args is "nartgs="?"" and no type, so it leaves it as string, then the assertion fails because it checks if this is a string.
Does that make sense ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

when i pass the repo in clearml-task with the parameters, it runs fine and finishes. Basically when I clone and attempt the task again, I get the above assert error I don't know why.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Now when I try to clone it, and run it on an agent, it fails to install the requirements.

Can you share the log? what fails exactly?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I've basically just added dataset id and model id parameters in the args.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

https://github.com/ultralytics/yolov5

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

You're suggesting that the false is considered a string and not a bool?

The clearml-server always stores the values as strings (serializing them), the casting is done when passed back to the code in runtime. The issue here is there is actually no "way" to tell the argparser this is a boolean (basically any value that will be passed is treated as string). What I think we should do is fix the casting function so that if this is exatcly the same value we use the default value (i.e. boolean). does that make sense to you?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

However cloning it uses it from the clearml args, which somehow converts it to string?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Anyway, in the docs, there is a function called task.register_artifact()

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

From what I recall, I think resume was set to false, originally and in the cloned task.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Takes in a name and an artifact object.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Another issue I'm having is I ran a task using clearml-task and did it using a repo. It runs fine, when I clone said task however and run it on the same queue again, it throws an error from the code. I can't seem to figure out why its happening.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

however when I clone or reset said task after completion and then enqueue it again, I get the above error.

This part is somewhat confusing... There is no magic happening behind the scenes, cloning a Task and creating it, is basically the same ... Do you have a reference to the YOLOv5 code base itself, maybe I can figure out what's the issue?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'm dumping a dict to json, how can i register that dict as an artifact

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

I download the dataset and model, and load them. Before training them again.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

up to date with https://fawad_nizamani@bitbucket.org/fawad_nizamani/custom_yolov5 ✅
Traceback (most recent call last):
File "train.py", line 682, in <module>
main(opt)
File "train.py", line 525, in main
assert os.path.isfile(ckpt), 'ERROR: --resume checkpoint does not exist'
AssertionError: ERROR: --resume checkpoint does not exist

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

I shared the error above. I'm simply trying to make the yolov5 by ultralytics part of my pipeline.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

I think I understand. Still I've possibly pinned down the issue to something else. I'm not sure if i'll be able to fix it on my own though.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Anyway, in the docs, there is a function called task.register_artifact()

Yes, this is rather deprecated... The idea is that it will monitor an obejct and auto sync it (i.e. serialize and upload).
That said, it is just so much easier to do task.upload_artifact and you can always update/overrwrite if you are passing the same name, that I cannot see the actual use case. Does that make sense? What are you using it for ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

for which I basically forked it for myself. and made it accept clearml dataset and model ids to use.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Write your answer

2K Views

29 Answers

3 years ago

2 years ago