Reputation
Badges 1
3 × Eureka!If you can identify a patten in the YOLOv8 output files you can probably also filter them out 🙂
I need to add callback for it to filter out anything with .pt
If anyone knows a better way, would love to hear about it 🙂
Hey 🙂 I had a similar issue today and found this solution:
In my case this codebase was using a .pt
filetype which was being picked up and logged as a model even though it was not.
import os
from clearml import Task
from clearml.binding.frameworks import WeightsFileHandler
task = Task.init(
project_name="task_project",
task_name="task_name",
task_type=Task.TaskTypes.training,
)
def filter_out_pt_files(operation_type, model_info):
is_pt_file = os.path.splitext...
As pytorch lightning is a framework on top of Pytorch it will work the same, if not better with Clear ML
One option might be to delete the local copy of the dataset and try to re-download it. Perhaps something has gone wrong with the local copy?
Currently running it on a t3.xlarge
which has 4CPU's, 16GB RAM and 300GB SSD
For an update 🙂
I think we identified that when moving from a training to fine tuning dataset (which was 1/1000th the size) our training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.
Since modifying this to be less frequent we have seen the index latency drop dramatically
Might be worth running the command again with the --verbose
flag. It will likely give more details on what is causing the failure
What does it look like when you instantiate the output_model
object?
If you added a print there like:
def filter_out_pt_files(operation_type, model_info):
print(model_info.__dict__)
return model_info
You can see what is bring picked up. If there is a common path etc you can filter that out
Also interested in how this is being approached 🙂 What you mentioned is exactly what I am doing
Hope you can get something to work 🤞
Also the error you are showing is inside the calculate_metrics.py
Is that a clear-ml lib or something custom
That looks good to me, not sure
Looks like its a /mnt
which might mean its a drive or something similar that was connected and may not be any more?
For something quick, if you create a new folder to put your dataset:mkdir ./test_dataset_location
Then you can run your command withCLEARML_CACHE_DIR='./test_dataset_location' clearml-data ... <your command here>
It will try to download into that folder
It happens, happy training 🚀