Dear Clearml Community, I Am Trying To Optimize Storage On My Clearml File Server When Doing A Lot Of Experiments. To Achieve This, I Already Upload Only The Newest And Best Checkpoints To Clearml File Server Instead Of All Checkpoints. Another Component

Answered

Dear ClearML community,
I am trying to optimize storage on my ClearML file server when doing a lot of experiments. To achieve this, I already upload only the newest and best checkpoints to ClearML file server instead of all checkpoints. Another component that is also taking a lot of place are the debug samples (i.e., images).
This is not an issue during training (since I use those as a debug mean), but I would be very interested to automatically erase them all right after training.
Does anyone know if there is a possibility to programmatically remove all debug samples (i.e., the entire content of the "metrics" folder on the ClearML file server) at the end of a training? 🤔
I am almost sure such as feature doesn't exist yet since this GitHub feature request about StorageManager for deleting files has been recently opened, but it's worth asking!
Thanks a lot for your support! 🙏

  				
Posted 
	one year ago

					More  		
  Report
		
					CrookedSeal85
				
					0
					 × 1

Votes Newest

Answers 12

Alright, thank you for your insight AgitatedDove14 ! I will check this link.
Regarding S3, that's a very good point, but the team I work with currently doesn't want to leverage an external cloud storage provider.

  				
Posted 
	one year ago

					More  		
  Report
		
					CrookedSeal85
				
					0
					 × 1

Okay thank you for your snippet AgitatedDove14 🙏 , I will investigate this class! 😉 👍

  				
Posted 
	one year ago

					More  		
  Report
		
					CrookedSeal85
				
					0
					 × 1

You're right yes 👍 , and this is precisely what I do 😁 . But when trying to access the fourth "page" with scroll_id returned on the third "page" I get above error and I am not able to access data on that fourth "page". This seems to be systematic: Using the scroll_id of the penultimate "page" doesn't allow to access to the very last "page" 🤔 .

I debugged using my browser and following URLs (based on the scheme "api_server" + "/events.get_task_events" + "?task=" + "<my-task_id>" + "&scroll_id=" + "<scroll_id-of-the-previous-page>") to see if I can access the events:

✅ First page: duck.erx:8008/events.get_task_events?task=41d606f6bd274d7e8c1297b50507b8a9
✅ Second page: None
✅ Third page: None
❌ Fourth (and final) page (with KeyError and impossibility to access remaining events): None
I my Python code, I use the requests package providing following params (containing the scroll_id I iteratively retrieve) to the requests.get() function:

params = {"scroll_id": scroll_id}

Same observation here as well: It works fine until reaching the very last "page" where the scroll_id of the penultimate "page" doesn't allow to access the data on this very last page 🙄 .

I am not sure, but I suppose there is an issue in ClearML API file "clearml/apiserver/bll/event/events_iterator.py" 🤔 , what do you think?

  				
Posted 
	one year ago

					More  		
  Report
		
					CrookedSeal85
				
					0
					 × 1

Hi AgitatedDove14 ,
Thanks a lot for your recommendation, that's exactly that! 🤩
I was able to use the scroll_id of the current "page" to access to events of the next "page"!
This works fine and I can now delete almost all debug samples.
I say "almost" because, apparently, using this technique of the scroll_id systematically does not allow to access to the events of the very last "page"...
In fact, as you can see on the picture below ⤵ , I have a total of 2014 events. I can trouble-free access to the events of the first, second and third "pages" (with respectively 500, 506 and 507 events) but unfortunately, providing the scroll_id value of the third "page", I cannot access to the remaining 2014-(500+506+507) = 501 events of the very last "page".
As you can see, I get following error:

Traceback (most recent call last):\n  File "/opt/clearml/apiserver/service_repo/service_repo.py", line 288, in handle_call\n    ret = endpoint.func(call, company, call.data_model)\n  File "/opt/clearml/apiserver/services/events.py", line 382, in get_task_events\n    res = event_bll.events_iterator.get_task_events(\n  File "/opt/clearml/apiserver/bll/event/events_iterator.py", line 51, in get_task_events\n    res.events, res.total_events = self._get_events(\n  File "/opt/clearml/apiserver/bll/event/events_iterator.py", line 132, in _get_events\n    "must": must + [{"term": {key.field: events[-1][key.field]}}]\nKeyError: 'iter'\n

This error suggests that there's an issue with accessing a key named iter within the code handling the pagination. It seems to be related to ClearML API server code itself.
Have you ever encountered such a KeyError issue? Would you also expect using scroll_id until the very last "page" to fetch the very last remaining data?
Again, thank you very much for your recommendation and help! 🙇

  				
Posted 
	one year ago

					More  		
  Report
		
					CrookedSeal85
				
					0
					 × 1

Nice!!!

Are you aware of a limitation of "/events.get_task_events" preventing from fetching some of the images stored on the server

Are you saying you see them in the UI, but cannot access them via the API ?
(this would be strange as the UI is firing the same API requests to the back end)

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi CrookedSeal85

I am trying to optimize storage on my ClearML file server when doing a lot of experiments.

This is not straight forward, you will need to get a list of all the events via
None
filter on image events
and then delete the the URL you are getting via the StorageManager.
But to be honest, why not just direct it to S3 or something like that ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you! 😁 🙇

Are you saying you see them in the UI, but cannot access them via the API ?

Yes, that's it! As you can see in the video above ⤴ , I can see the remaining images (i.e., the images that haven't been deleted) both in the UI and physically on my disk storage, but cannot access them via the API (their leading URL does not exist).

(this would be strange as the UI is firing the same API requests to the back end)

And yes, this is strange but it's what I think! 😲 A few remaining images cannot be accessed via the API 🙁 .
I can't prove it easily, but while debugging my code snippet I listed all images accessible via the API and the remaining images are precisely those that do not appear in the API event list (I was not able to find them on the API).
In other words, as you can see in the picture below ⤵ , some of the events contain one JPEG image URL (and that's fine 👍 , I could retrieve each of those image URLs to delete the corresponding image from the server ✅ ), but unfortunately no event contains the URL that could have lead to the remaining few images.
Consequently, since the API doesn't seem to be aware of the existence of those images, those images cannot be accessed and hence cannot be deleted using the API. They simply remain on the server and are still visible in the UI after running my code.

This is why I wanted to ask if you ever encountered such a limitation of ClearML API with "events.get_task_events" service, or what we could do to avoid omitting those few remaining images on the server 🤔 .

Thank you again for your precious support! 🙏

  				
Posted 
	one year ago

					More  		
  Report
		
					CrookedSeal85
				
					0
					 × 1

However, regarding your recommendation of using

StorageManager

class to delete the URL, it seems that this class only contains methods for checking existence of files, downloading files and uploading files, but

no method

for actually

deleting

files based on their URL (see doc

and

).

Yes you are correct 😞 you should use a "deeper" class:

helper = StorageHelper.get(remote_url)
helper.delete(remote_url)

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Notice that you need to pass the returned scroll_id to the next call

scroll_id = response["scroll_id"]

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

None
notice there is a scroll_id there, you might need to call the API multiple times until you scroll over All the events
could that be it?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 ,
Thanks again for your insight.
I see how to retrieve the URLs via " POST /events.get_task_events ".
However, regarding your recommendation of using StorageManager class to delete the URL, it seems that this class only contains methods for checking existence of files, downloading files and uploading files, but no method for actually deleting files based on their URL (see doc here and here ).
What do you have in mind when saying:

delete the URL you are getting via the StorageManager

Are you sure this feature exists?
Thank you very much again for your support! 🙏

  				
Posted 
	one year ago

					More  		
  Report
		
					CrookedSeal85
				
					0
					 × 1

Hello AgitatedDove14 ,

Good news! It seems that using the list of URLs retrieved via "POST /events.get_task_events" and then deleting the corresponding images using StorageHelper class effectively does the trick! 🏆 FYI, here is the little function I wrote for deleting the files:

    from clearml.storage.helper import StorageHelper
    
    def delete_image_from_clearml_server(image_url: str) -> None:
        storage_helper = StorageHelper.get(url=image_url)
        try:
            storage_helper.delete(path=image_url)
        except Exception as e:
            raise ValueError(f"Could not remove image with URL '{image_url}': {e}")

Now, even if this function with StorageHelper works fine, I observed that some images are not referenced in "/events.get_task_events" event list (even if I wait some time) 😥 . In fact, they physically exist on the server, but no URL points towards those. This implies that those few images are not detected and hence not deleted.

Are you aware of a limitation of "/events.get_task_events" preventing from fetching some of the images stored on the server? 🤔

Here is a picture and a video illustrating the fact that 18 images were effectively deleted, but 2 have not been listed in the events of "/events.get_task_events" and were hence not deleted...

Thank you very much for your insight!
Have a nice weekend 😉

  				
Posted 
	one year ago

					More  		
  Report
		
					CrookedSeal85
				
					0
					 × 1

Write your answer

1K Views

12 Answers

one year ago