Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Dear Clearml Community, I Am Trying To Optimize Storage On My Clearml File Server When Doing A Lot Of Experiments. To Achieve This, I Already Upload Only The Newest And Best Checkpoints To Clearml File Server Instead Of All Checkpoints. Another Component

Dear ClearML community,
I am trying to optimize storage on my ClearML file server when doing a lot of experiments. To achieve this, I already upload only the newest and best checkpoints to ClearML file server instead of all checkpoints. Another component that is also taking a lot of place are the debug samples (i.e., images).
This is not an issue during training (since I use those as a debug mean), but I would be very interested to automatically erase them all right after training.
Does anyone know if there is a possibility to programmatically remove all debug samples (i.e., the entire content of the "metrics" folder on the ClearML file server) at the end of a training? ๐Ÿค”
I am almost sure such as feature doesn't exist yet since this GitHub feature request about StorageManager for deleting files has been recently opened, but it's worth asking!
Thanks a lot for your support! ๐Ÿ™

  
  
Posted 2 months ago
Votes Newest

Answers 12


You're right yes ๐Ÿ‘ , and this is precisely what I do ๐Ÿ˜ . But when trying to access the fourth "page" with scroll_id returned on the third "page" I get above error and I am not able to access data on that fourth "page". This seems to be systematic: Using the scroll_id of the penultimate "page" doesn't allow to access to the very last "page" ๐Ÿค” .

I debugged using my browser and following URLs (based on the scheme "api_server" + "/events.get_task_events" + "?task=" + "<my-task_id>" + "&scroll_id=" + "<scroll_id-of-the-previous-page>") to see if I can access the events:

  • โœ… First page: duck.erx:8008/events.get_task_events?task=41d606f6bd274d7e8c1297b50507b8a9
  • โœ… Second page: None
  • โœ… Third page: None
  • โŒ Fourth (and final) page (with KeyError and impossibility to access remaining events): None
    I my Python code, I use the requests package providing following params (containing the scroll_id I iteratively retrieve) to the requests.get() function:
params = {"scroll_id": scroll_id}

Same observation here as well: It works fine until reaching the very last "page" where the scroll_id of the penultimate "page" doesn't allow to access the data on this very last page ๐Ÿ™„ .

I am not sure, but I suppose there is an issue in ClearML API file "clearml/apiserver/bll/event/events_iterator.py" ๐Ÿค” , what do you think?

  
  
Posted one month ago

Hi @<1523701205467926528:profile|AgitatedDove14> ,
Thanks a lot for your recommendation, that's exactly that! ๐Ÿคฉ
I was able to use the scroll_id of the current "page" to access to events of the next "page"!
This works fine and I can now delete almost all debug samples.
I say "almost" because, apparently, using this technique of the scroll_id systematically does not allow to access to the events of the very last "page"...
In fact, as you can see on the picture below โคต , I have a total of 2014 events. I can trouble-free access to the events of the first, second and third "pages" (with respectively 500, 506 and 507 events) but unfortunately, providing the scroll_id value of the third "page", I cannot access to the remaining 2014-(500+506+507) = 501 events of the very last "page".
As you can see, I get following error:

Traceback (most recent call last):\n  File "/opt/clearml/apiserver/service_repo/service_repo.py", line 288, in handle_call\n    ret = endpoint.func(call, company, call.data_model)\n  File "/opt/clearml/apiserver/services/events.py", line 382, in get_task_events\n    res = event_bll.events_iterator.get_task_events(\n  File "/opt/clearml/apiserver/bll/event/events_iterator.py", line 51, in get_task_events\n    res.events, res.total_events = self._get_events(\n  File "/opt/clearml/apiserver/bll/event/events_iterator.py", line 132, in _get_events\n    "must": must + [{"term": {key.field: events[-1][key.field]}}]\nKeyError: 'iter'\n

This error suggests that there's an issue with accessing a key named iter within the code handling the pagination. It seems to be related to ClearML API server code itself.
Have you ever encountered such a KeyError issue? Would you also expect using scroll_id until the very last "page" to fetch the very last remaining data?
Again, thank you very much for your recommendation and help! ๐Ÿ™‡
image

  
  
Posted one month ago

Thank you! ๐Ÿ˜ ๐Ÿ™‡

Are you saying you see them in the UI, but cannot access them via the API ?

Yes, that's it! As you can see in the video above โคด , I can see the remaining images (i.e., the images that haven't been deleted) both in the UI and physically on my disk storage, but cannot access them via the API (their leading URL does not exist).

(this would be strange as the UI is firing the same API requests to the back end)

And yes, this is strange but it's what I think! ๐Ÿ˜ฒ A few remaining images cannot be accessed via the API ๐Ÿ™ .
I can't prove it easily, but while debugging my code snippet I listed all images accessible via the API and the remaining images are precisely those that do not appear in the API event list (I was not able to find them on the API).
In other words, as you can see in the picture below โคต , some of the events contain one JPEG image URL (and that's fine ๐Ÿ‘ , I could retrieve each of those image URLs to delete the corresponding image from the server โœ… ), but unfortunately no event contains the URL that could have lead to the remaining few images.
Consequently, since the API doesn't seem to be aware of the existence of those images, those images cannot be accessed and hence cannot be deleted using the API. They simply remain on the server and are still visible in the UI after running my code.

This is why I wanted to ask if you ever encountered such a limitation of ClearML API with "events.get_task_events" service, or what we could do to avoid omitting those few remaining images on the server ๐Ÿค” .

Thank you again for your precious support! ๐Ÿ™
image

  
  
Posted one month ago

None
notice there is a scroll_id there, you might need to call the API multiple times until you scroll over All the events
could that be it?

  
  
Posted one month ago

Notice that you need to pass the returned scroll_id to the next call

scroll_id = response["scroll_id"]
  
  
Posted one month ago

Hello @<1523701205467926528:profile|AgitatedDove14> ,

Good news! It seems that using the list of URLs retrieved via "POST /events.get_task_events" and then deleting the corresponding images using StorageHelper class effectively does the trick! ๐Ÿ† FYI, here is the little function I wrote for deleting the files:

    from clearml.storage.helper import StorageHelper
    
    def delete_image_from_clearml_server(image_url: str) -> None:
        storage_helper = StorageHelper.get(url=image_url)
        try:
            storage_helper.delete(path=image_url)
        except Exception as e:
            raise ValueError(f"Could not remove image with URL '{image_url}': {e}")

Now, even if this function with StorageHelper works fine, I observed that some images are not referenced in "/events.get_task_events" event list (even if I wait some time) ๐Ÿ˜ฅ . In fact, they physically exist on the server, but no URL points towards those. This implies that those few images are not detected and hence not deleted.

Are you aware of a limitation of "/events.get_task_events" preventing from fetching some of the images stored on the server? ๐Ÿค”

Here is a picture and a video illustrating the fact that 18 images were effectively deleted, but 2 have not been listed in the events of "/events.get_task_events" and were hence not deleted...

Thank you very much for your insight!
Have a nice weekend ๐Ÿ˜‰
image

  
  
Posted one month ago

However, regarding your recommendation of using

StorageManager

class to delete the URL, it seems that this class only contains methods for checking existence of files, downloading files and uploading files, but

no method

for actually

deleting

files based on their URL (see doc

and

).

Yes you are correct ๐Ÿ˜ž you should use a "deeper" class:

helper = StorageHelper.get(remote_url)
helper.delete(remote_url)
  
  
Posted one month ago

Alright, thank you for your insight @<1523701205467926528:profile|AgitatedDove14> ! I will check this link.
Regarding S3, that's a very good point, but the team I work with currently doesn't want to leverage an external cloud storage provider.

  
  
Posted 2 months ago

Okay thank you for your snippet @<1523701205467926528:profile|AgitatedDove14> ๐Ÿ™ , I will investigate this class! ๐Ÿ˜‰ ๐Ÿ‘

  
  
Posted one month ago

Nice!!!

Are you aware of a limitation of "/events.get_task_events" preventing from fetching some of the images stored on the server

Are you saying you see them in the UI, but cannot access them via the API ?
(this would be strange as the UI is firing the same API requests to the back end)

  
  
Posted one month ago

Hi @<1523701205467926528:profile|AgitatedDove14> ,
Thanks again for your insight.
I see how to retrieve the URLs via " POST /events.get_task_events ".
However, regarding your recommendation of using StorageManager class to delete the URL, it seems that this class only contains methods for checking existence of files, downloading files and uploading files, but no method for actually deleting files based on their URL (see doc here and here ).
What do you have in mind when saying:

delete the URL you are getting via the StorageManager

Are you sure this feature exists?
Thank you very much again for your support! ๐Ÿ™

  
  
Posted 2 months ago

Hi @<1663354518726774784:profile|CrookedSeal85>

I am trying to optimize storage on my ClearML file server when doing a lot of experiments.

This is not straight forward, you will need to get a list of all the events via
None
filter on image events
and then delete the the URL you are getting via the StorageManager.
But to be honest, why not just direct it to S3 or something like that ?

  
  
Posted 2 months ago