This would be a good example?
https://github.com/allegroai/clearml/blob/master/examples/services/monitoring/slack_alerts.py
DrabCockroach54 , you can set it all up. I suggest you open developer tools (F12) and see how it is done in the UI. You can then implement this in code.
For example to filter tasks that started 10 minutes ago is something you can view via the UI
Perhaps due to size? Are you running behind any firewall or any other network component?
can you try something like:client.tasks.get_all(status=["in_progress"])
How do I know what are possible options for status? Same for other parameters.
I don't see those in documentation.
https://clear.ml/docs/latest/docs/references/api/tasks#post-tasksget_all
My goal is to detect events when task does not uses allocated resources (e.g. GPU) for some period of time.
I am still trying to understand clearml api response.
Do you have any clue how can I get it from client.tasks.get_all(status=["in_progress"]) ?
If task has GPU allocated but not using it, would it be in in_progress status also? I want to collect those task.
I see task runtime info. I guess it's current utilization not allocation but not sure.
"runtime": {
"progress": "0",
"platform": "linux",
"python_version": "3.8.0",
"python_exec": "/root/.clearml/venvs-builds/3.8/bin/python",
"OS": "Linux-5.15.0-1013-gcp-x86_64-with-glibc2.27",
"processor": "x86_64",
"cpu_cores": 256,
"memory_gb": 1007.7,
"hostname": "",
"gpu_count": 1,
"gpu_type": "NVIDIA xxx -40GB",
That's exactly what I did... I was thinking more in terms of the size of the response body and not the different endpoint
This section is internal implementation - we can't guarantee it will not be changed. As for unused GPU - in general if you run a task with the agent having the --gpu switch a GPU will be allocated for as long as the task is running. I think the main concern is trying to make sure your task makes the most out of the GPU...?
SuccessfulKoala55 Yeah, that's possible but then I don't get any firewall will block only one endpoint response. I tried both workers.get_all() and get_stats(), both worked.
Can you share the snippet you used for tasks.get_all() ?
` from clearml.backend_api.session.client import APIClient
from time import time
Create an instance of APIClient
client = APIClient()
tasks = client.tasks.get_all() `This is what I used.
Doc mentions required request Body parameter type. Do I need to add this as a parameter?
I think something is either wrong with my request or it could be my permissions.
I am testing api with production ClearML server running in our production. It's running fine.
I see it now.
"5451af93e0bf68a4ab09f654b222ccae": { "1b790a3da2e8d6cd939cf271694fe81b": { "metric": ":monitor:gpu", "variant": "gpu_0_utilization", "value": 0.0, "min_value": 0.0, "max_value": 3.542 }, "409d4e6ad9b69b3224fceeac6e265ddc": { "metric": ":monitor:gpu", "variant": "gpu_0_mem_used_gb", "value": 0.0, "min_value": 0.0, "max_value": 0.0 }, "74646afee0e0ab18d3cbd08ce1ff6aa3": { "metric": ":monitor:gpu", "variant": "gpu_0_mem_usage", "value": 0.002, "min_value": 0.002, "max_value": 54.739 }, "abdb01e1de566d2165e902fe0839465e": { "metric": ":monitor:gpu", "variant": "gpu_0_mem_free_gb", "value": 47.461, "min_value": 21.482, "max_value": 47.461 }, "db472ace8c40b8a9f3e11ec348920662": { "metric": ":monitor:gpu", "variant": "gpu_0_temperature", "value": 46.0, "min_value": 45.0, "max_value": 59.46 } } },
Eg. To query tasks that are both Running --> You mean status=["in_progress"] ?? How do I figure out other possible parameter I can use with status parameter?
Another,
Filter only tasks that start say 10 min ago . Is there any parameter for it also ?
query
tasks
that are both Running --> You mean
status=["in_progress"]
Yes!
How do I figure out other possible parameter I can use with
status
parameter?
https://clear.ml/docs/latest/docs/references/api/tasks#post-tasksget_all
https://clear.ml/docs/latest/docs/references/api/definitions#taskstask
Filter only tasks that start say
10 min ago
. Is there any parameter for it also ?
last_update or created then use similar filter to this one:
https://github.com/allegroai/clearml/blob/ff7b174bf162347b82226f413040ff6473401e92/examples/services/cleanup/cleanup_service.py#L70
Is it your own server installation or are you using the SaaS?
DrabCockroach54 that is quite cool!
Basically here is what I would do
Query Tasks that are both Running and Do not have system tag "development" (that means running on agents) + filter only tasks that start say 10 min ago Go over the list and see if (1) they have GPU scalar reported (meaning GPU is accessible) (2) min/max/val of GPU utilization is under 5%wdyt?
I found system_tags and all the metrics including CPU but can't find any field mentions GPU scalar reported or GPU utilization.
When in table view (rows) there is a small icon next to the 'Started' column. There you can configure time periods you'd like to view 🙂
I see. Dev tools is useful here for finding api endpoints used for the data and
https://github.com/allegroai/clearml/blob/master/clearml/task.py#L987 what I was looking for.
Thanks
` # which python
/Users/anuj.tyagi/clearml_api/venv/bin/python
(venv) LMWPRW6F3:clearml_api root# pip freeze | grep clearml
clearml==1.7.2
Traceback (most recent call last):
File "get_all_task.py", line 8, in <module>
print (client.tasks.get_all())
File "/Users/anuj.tyagi/clearml_api/venv/lib/python3.8/site-packages/clearml/backend_api/session/client/client.py", line 422, in get
result=self.session.send(request_cls(*args, **kwargs)),
File "/Users/anuj.tyagi/clearml_api/venv/lib/python3.8/site-packages/clearml/backend_api/session/client/client.py", line 124, in send
raise APIError(result, extra_info="Invalid response")
clearml.backend_api.session.client.client.APIError: APIError: Invalid response: code 200: OK `
Yeah the doctring is always the most updated 🙂
It would be great to have possible fields in the given parameters mentioned here: https://clear.ml/docs/latest/docs/references/api/tasks#post-tasksget_all
Any clue how do I figure out those?
DrabCockroach54 I just tested with both ClearML SDK 1.7.1 and 1.7.2 and both returned a valid response to client.tasks.get_all()
when running against the free-hosted app.clear.ml
How can it be even this kind of issue with Python when one endpoint is giving response and other not.
"tags": [], "system_tags": [ "interactive" ], "status_changed": "2022-10-13 17:05:22.844000+00:00", "status_message": "", "status_reason": "", "last_worker": "xxx01:!2c1:cpu:10:service:0a750bd8a09b4063a59c96b4370d0815", "last_worker_report": "2022-10-30 15:23:18.695000+00:00", "last_update": "2022-10-30 15:23:18.695000+00:00", "last_change": "2022-10-30 15:23:18.695000+00:00", "last_iteration": 0, "last_metrics": { "29c6dd717a649f7c1835bfa9249b3142": { "028d9091618657f296222d768c3dd9b8": { "metric": ":monitor:machine", "variant": "network_rx_mbs", "value": 1.691, "min_value": -23.836, "max_value": 301.954 }, "1a760266c35f86529f9c669d539a2297": { "metric": ":monitor:machine", "variant": "io_read_mbs", "value": 0.201, "min_value": 0.0, "max_value": 919.899 }, "22db6a87b76b02b50d0a8c54879484ce": { "metric": ":monitor:machine", "variant": "io_write_mbs", "value": 1.312, "min_value": 0.279, "max_value": 2098.717 }, "3964adf302d5c935e9a2451b45bd53a5": { "metric": ":monitor:machine", "variant": "memory_free_gb", "value": 911.466, "min_value": 656.194, "max_value": 943.75 }, "5385df90d0d0ad8955159a5307d34b38": { "metric": ":monitor:machine", "variant": "cpu_usage", "value": 41.059, "min_value": 3.538, "max_value": 93.25 }, "5d2e34a3c7e733e0549fa6d9c9666ce3": { "metric": ":monitor:machine", "variant": "network_tx_mbs", "value": 1.741, "min_value": -334.512, "max_value": 291.802 }, "7e44abd211aa00a7c3bf5090fb33df90": { "metric": ":monitor:machine", "variant": "memory_used_gb", "value": 1.204, "min_value": 0.143, "max_value": 1.205 }, "f4f4fd050d744fb78fc0bb7b5a2a9f99": { "metric": ":monitor:machine", "variant": "disk_free_percent", "value": 44.5, "min_value": 42.6, "max_value": 51.5 } } }, "hyperparams": { "interactive_session": { "user_base_directory": { "section": "interactive_session", "name": "user_base_directory", "value": "~/", "type": "str" }, "ssh_server": { "section": "interactive_session", "name": "ssh_server", "value": "True", "type": "bool" }, "default_docker": { "section": "interactive_session", "name": "default_docker", "value": " ", "type": "str" }, "jupyterlab": { "section": "interactive_session", "name": "jupyterlab", "value": "True", "type": "bool" }, "vscode_server": { "section": "interactive_session", "name": "vscode_server", "value": "True", "type": "bool" }, "public_ip": { "section": "interactive_session", "name": "public_ip", "value": "False", "type": "bool" }, "ssh_ports": { "section": "interactive_session", "name": "ssh_ports", "value": "", "type": "str" }, "vscode_version": { "section": "interactive_session", "name": "vscode_version", "value": "", "type": "str" } }, "properties": { "external_address": { "section": "properties", "name": "external_address", "value": "" }, "internal_ssh_port": { "section": "properties", "name": "internal_ssh_port", "value": "" }, "jupyter_port": { "section": "properties", "name": "jupyter_port", "value": "" }, "internal_stable_ssh_port": { "section": "properties", "name": "internal_stable_ssh_port", "value": "" }, "vscode_port": { "section": "properties", "name": "vscode_port", "value": "" } } },
You should have metric :monitor:gpu
variant gpu_0_utilization
Since I see you have none of those, that points to no GPU driver ...
Could that be ?
Exactly. I am trying to create alert for tasks that have GPU/CPU allocated but not utilizing it from x period of time.
So, if task is there, GPU will be allocated to it. I will need to check if task is using it or just idle.
I think the only reason you'll get that is if the returned payload was stripped somehow from the call result