Reputation
Badges 1
371 × Eureka!I ran a training code from a github repo. It saves checkpoints every 2000 iterations. Only problem is I'm training it for 3200 epochs and there's more than 37000 iterations in each epoch. So the checkpoints just added up. I've stopped the training for now. I need to delete all of those checkpoints before I start training again.
I plan to append the checkpoint to a list, when the len(list) > N, I'll just pop out the one with the highest loss, and delete that file from clearml and storage. That's how I plan to work with it.
Basically if I pass an arg with a default value of False, which is a bool, it'll run fine originally, since it just accepted the default value.
So right now, I'm creating an OutputModel and passing the current task in the constructor. Then I just save the tensorflow keras model. When I look at the details, model artifact in the ClearML UI, it's been saved the usual way, and no tags that I added in the OutputModel constructor are there. From which to me it seems that ClearML is auto logging the model, and the model isn't connected to the OutputModel object that I created.
You're saying that the model should get connected if I call up...
I initially wasn't able to get the value this way.
It was working fine for a while but then it just failed.
Basically trying to keep track of how much of the tracking and record keeping is done by ClearML for me? And what things do I need to keep a track of manually in a database.
My use case is basically if I want to now access this dataset from somewhere else, shouldn't I be able to do so using its id?
'dataset' is the name of my Dataset Object
Also what's the difference between Finalize vs Publish?
I'm not in the best position to answer these questions right now.
This works, thanks. Do you have any link to where I can also see the parameters of the Dataset class or was it just on git?
This is the original repo which I've slightly modified.
I shared the error above. I'm simply trying to make the yolov5 by ultralytics part of my pipeline.
From what I recall, I think resume was set to false, originally and in the cloned task.
Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb2191dcaf0>: Failed to establish a new connection: [Errno 111] Connection refused')': /auth.login
Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb2191e10a0>: Failed to establish a new connection: ...
I've basically just added dataset id and model id parameters in the args.
dataset = Dataset.create(data_name, project_name)
print('Dataset Created, Adding Files...')
dataset.add_files(data_dir)
print('Files added succesfully, Uploading Files...')
dataset.upload(output_url=upload_dir, show_progress
The situation is such that I needed a continuous training pipeline to train a detector, the detector being Ultralytics Yolo V5.
To me, it made sense that I would have a training task. The whole training code seemed complex to me so I just modified it just a bit to fit my needs of it getting dataset and model from clearml. Nothing more.
I think created a task using clearml-task and pointed it towards the repo I had created. The task runs fine.
I am unsure at the details of the training code...
so when I run the task using clearml-task --repo and create a task, it runs fine. It runs into the above error when I clone the task or reset it.
keeps retrying and failing when I use Dataset.get
That is true. If I'm understanding correctly, by configuration parameters, you mean using arg parse right?