SuccessfulKoala55 CostlyOstrich36 actually it is the import
statement, just finally got around to the traceback:
` File "/home/.../ccmlp/configs/mlops.py", line 4, in <module>
from clearml import Task
File "/home/.../.venv/lib/python3.8/site-packages/clearml/init.py", line 4, in <module>
from .task import Task
File "/home/.../.venv/lib/python3.8/site-packages/clearml/task.py", line 31, in <module>
from .backend_interface.metrics import Metrics
File "/home/......
Hey SuccessfulKoala55 ! Is the configuration file needed for Task.running_locally()
? This is tightly related with issue #395, where we need additional files for remote execution but have no way to attach them to the task other then using the StorageManager
as a temporary cache.
That's what I thought too, it should only look for the CLEARML_TASK_ID
environment variable?
I see, okay that already clarifies some stuff, I'll dig a bit more into this then! Thanks!
The screenshot is small since the data is private anyway, but it's enough to see:
"Metric: untitled 00" "plot image" as the image title The attached histogram has a title ("histogram of ...")
We load the endpoint (and S3 credentials) from a .env
file, so they're not immediately available at the time of from clearml import Task
.
It's a convenience thing, rather than exporting many environment variables that are tied together.
I can also do this via Mongo directly, but I was hoping to skip the K8S interaction there.
I should maybe mention that the security regarding this is low, since this is all behind a private VPN server anyway, I'm mostly just interested in having the credentials used for backtracking purposes
Let me know if there's any additional information that can help SuccessfulKoala55 !
Full log:
` command: /usr/sbin/helm --version=4.1.2 upgrade -i --reset-values --wait -f=/tmp/tmp77d9ecye.yml clearml clearml/clearml
msg: |-
Failure when executing Helm command. Exited 1.
stdout:
stderr: W0728 09:23:47.076465 2345 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0728 09:23:47.126364 2345 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unava...
On an unrelated note, when cloning an experiment via the WebUI, shouldn't the cloned experiment have the original experiment as a parent? It seems to be empty
I would expect the service to actually implicitly inject it to new instances prior to applying the user's extra configuration 🤔
-ish, still debugging some weird stuff. Sometimes ClearML picks ip
and sometimes ip2
, and I can't tell why 🤔
I am indeed
AgitatedDove14 for future reference this is indeed a PEP-610 related bug, fixed in https://python-poetry.org/blog/announcing-poetry-1.2.0a1/ . I see we can choose the pip
version in the config, can we also set the poetry
version used? Or is it updated from the lock file itself, or...?
Sure, for example when reporting HTML files:
Does that make sense CostlyOstrich36 ? Any thoughts on how to treat this? For the time being I'm also perfectly happy to include something specific to extra_clearml_conf
, but I'm not sure how to set the sdk.aws.s3.credentials
to be a list of dictionaries as needed
The error seems to come from this line:self._driver = _FileStorageDriver(str(path_driver_uri.root))
(line #353 in clearml/storage/helper.py
Where if the path_driver
is a local path, then the _FileStorageDriver
starts with a base_path = '/'
, and then takes extremely long time at iterating over the entire file system (e.g. in _get_objects
, line #1931 in helper.py
)
It's also sufficient to see StorageManager.list("/data/clear")
takes a really long time to return no results
proj_suffix = "" i = 2 while Task.get_project_id(f"{proj_name}{proj_suffix}") is not None: tasks = Task.get_tasks(project_name=f"{proj_name}{proj_suffix}") if not [task for task in tasks if not task.get_archived()]: # Empty project, we can use this one... break proj_suffix = f"_{i}" i += 1
I think now there's the following:
Resource type Queue (name) defines resource + max instancesAnd I'm looking for:
Resource type "pool" of resources (type + max instances) A pool can be shared among queues
This could be relevant SuccessfulKoala55 ; might entail some serious bug in ClearML multiprocessing too - https://stackoverflow.com/questions/45665991/multiprocessing-returns-too-many-open-files-but-using-with-as-fixes-it-wh
Yup! Seems to have been some brief unavailability for some reason
The title is specified in the plot (see the example, even if small).
I'm just creating a figure normally with matplotlib and save it to disk.
Hm. Is there a simple way to test tasks, one at a time?
No, I have no running agents listening to that queue. It's as if it's retained in some memory somewhere and the server keeps creating it.