Reputation
Badges 1
3 × Eureka!Might be worth running the command again with the --verbose
flag. It will likely give more details on what is causing the failure
Also the error you are showing is inside the calculate_metrics.py
Is that a clear-ml lib or something custom
It happens, happy training 🚀
For an update 🙂
I think we identified that when moving from a training to fine tuning dataset (which was 1/1000th the size) our training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.
Since modifying this to be less frequent we have seen the index latency drop dramatically
If you added a print there like:
def filter_out_pt_files(operation_type, model_info):
print(model_info.__dict__)
return model_info
You can see what is bring picked up. If there is a common path etc you can filter that out
Hey 🙂 I had a similar issue today and found this solution:
In my case this codebase was using a .pt
filetype which was being picked up and logged as a model even though it was not.
import os
from clearml import Task
from clearml.binding.frameworks import WeightsFileHandler
task = Task.init(
project_name="task_project",
task_name="task_name",
task_type=Task.TaskTypes.training,
)
def filter_out_pt_files(operation_type, model_info):
is_pt_file = os.path.splitext...
Currently running it on a t3.xlarge
which has 4CPU's, 16GB RAM and 300GB SSD
If you can identify a patten in the YOLOv8 output files you can probably also filter them out 🙂
Hope you can get something to work 🤞
Also interested in how this is being approached 🙂 What you mentioned is exactly what I am doing
I need to add callback for it to filter out anything with .pt
If anyone knows a better way, would love to hear about it 🙂
Looks like its a /mnt
which might mean its a drive or something similar that was connected and may not be any more?
For something quick, if you create a new folder to put your dataset:mkdir ./test_dataset_location
Then you can run your command withCLEARML_CACHE_DIR='./test_dataset_location' clearml-data ... <your command here>
It will try to download into that folder
One option might be to delete the local copy of the dataset and try to re-download it. Perhaps something has gone wrong with the local copy?
As pytorch lightning is a framework on top of Pytorch it will work the same, if not better with Clear ML
What does it look like when you instantiate the output_model
object?
That looks good to me, not sure