Reputation
Badges 1
3 × Eureka!That looks good to me, not sure
Hey 🙂 I had a similar issue today and found this solution:
In my case this codebase was using a .pt filetype which was being picked up and logged as a model even though it was not.
import os
from clearml import Task
from clearml.binding.frameworks import WeightsFileHandler
task = Task.init(
project_name="task_project",
task_name="task_name",
task_type=Task.TaskTypes.training,
)
def filter_out_pt_files(operation_type, model_info):
is_pt_file = os.path.splitext...
What does it look like when you instantiate the output_model object?
For an update 🙂
I think we identified that when moving from a training to fine tuning dataset (which was 1/1000th the size) our training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.
Since modifying this to be less frequent we have seen the index latency drop dramatically
Also the error you are showing is inside the calculate_metrics.py
Is that a clear-ml lib or something custom
It happens, happy training 🚀
I need to add callback for it to filter out anything with .pt
Also interested in how this is being approached 🙂 What you mentioned is exactly what I am doing
If anyone knows a better way, would love to hear about it 🙂
If you added a print there like:
def filter_out_pt_files(operation_type, model_info):
print(model_info.__dict__)
return model_info
You can see what is bring picked up. If there is a common path etc you can filter that out
If you can identify a patten in the YOLOv8 output files you can probably also filter them out 🙂
As pytorch lightning is a framework on top of Pytorch it will work the same, if not better with Clear ML
One option might be to delete the local copy of the dataset and try to re-download it. Perhaps something has gone wrong with the local copy?
Might be worth running the command again with the --verbose flag. It will likely give more details on what is causing the failure
Currently running it on a t3.xlarge which has 4CPU's, 16GB RAM and 300GB SSD
Hope you can get something to work 🤞
Looks like its a /mnt which might mean its a drive or something similar that was connected and may not be any more?
For something quick, if you create a new folder to put your dataset:mkdir ./test_dataset_location
Then you can run your command withCLEARML_CACHE_DIR='./test_dataset_location' clearml-data ... <your command here>
It will try to download into that folder