Yup, latest version of ClearML SDK, and we're deployed on AWS using K8s helm
i.e.ERROR Fetching experiments failed. Reason: Backend timeout (600s)ERROR Fetching experiments failed. Reason: Invalid project ID
Ultimately we're trying to avoid docker in AWS autoscaler (virtualization on top of virtualization seems redundant), and instead we maintain an AMI for a faster boot sequence.
We had no issues when we used pip , but now when trying to work with poetry all these issues came up.
The way I understand poetry to work, is that it is expected there is one system-wide installation that is used for virtual environment creation and manipulation. So at least it may be desired that the ...
Okay, I'll test it out by trying to downgrade to 4.0.0 and then upgrade to 4.1.2
Just to make sure, the chart_ref is allegroai/clearml right? (for some reason we had clearml/clearml and it seems like it previously worked?)
So some UI that shows the contents of users.get_all ?
Nothing I can spot --
ClearML results page:
ClearML pipeline page:
Launching the next 2 steps
Launching step [...]
Launching step [...]
Launching step: ...
Parameters:
{...}
Configurations:
{}
Overrides:
{}
Launching step: ...
Parameters:
{...}
Configurations:
{}
Overrides:
{}
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
2023-02-21 13:53:48
ClearML Monitor: Could not detect iteration reporting, falling back to itera...
I just used this to create the dual_gpu queue:clearml-agent daemon --queue dual_gpu --create-queue --gpus 0,1 --detached
Last but not least - can I cancel the offline zip creation if I'm not interested in it 🤔
EDIT: I see not, guess one has to patch ZipFile ...
Consider e.g:
# steps.py
class DataFetchingStep:
def __init__(self, source, query, locations, timestamps):
# ...
def run(self, queue=None, **kwargs):
# ...
class DataTransformationStep:
def __init__(self, inputs, transformations):
# inputs can include instances of DataFetchingStep, or local files, for example
# ...
def run(self, queue=None, **kwargs):
# ...
And then the following SDK usage in a notebook:
from steps imp...
Since the additional credentials are available to the autoscaler when it boots up (via the config file), I thought it could use those natively?
For the former (static-ish environment variables), just add:
environment {
VAR1: value1
VAR2: value2
}
to the agent’s clearml.conf
The api.files_server is set to the MinIO endpoint s3://ip:9000/clearml (both locally and remotely) The sdk.development.default_output_uri is set to the MinIO endpoint (both locally and remotely) When we call Task.init I do not set the output_uri at all I get the logger directly with task.get_logger()
It is installed on the pipeline creating the machine.
I have no idea why it did not automatically detect it 😞
Is there a way to specify that flag within the config file, SuccessfulKoala55 ?
I’ve tracked it down further, it seems the pigar utility does not apply any smart logic there.
The case we have is the following -
- We have a monorepo, but all modules/libs share a common namespace
foo; so e.g. working on modulemod, we usefrom foo.mod import … - This then looks for a module called
foo, even though it’s just a namespace - In the dist-info requirement, it seems any hyphen, dot, etc are swapped for an underscore, so our site-packages represents this as `foo_m...
The Task.init is called at a later stage of the process, so I think this relates again to the whole setup process we've been discussing both here and in #340... I promise to try ;)
Latest (1.5.1 I believe?), full log incoming, but it's like I've posted elsewhere already 🤔
It just sets up the environment and immediately crashes when trying to run the code.
The setup itself is done correctly.
We have a more complicated case but I'll work around it 😄
Follow up though - can configuration objects refer to one-another internally in ClearML?
But... Which queue does it listen to, and which type of instances will it use etc
Oh! Nice! I'll have a go at it and report back at the PR if it's in a functional state 🙂 Thanks AgitatedDove14 !
Yeah I will probably end up archiving them for the time being (or deleting if possible?).
Otherwise (regarding the code question), I think it’s better if we continue the original thread, as it has a sample code snippet to illustrate what I’m trying to do.
Note that it would succeed if e.g. run with pytest -s
If everything is managed with a git repo, does this also mean PRs will have a messy metadata file attached to them?
nevermind! Found and answered (solution in the issue linked above)
Thanks AgitatedDove14 , I'll give it a try. Perhaps additional documentation is needed for that extra_layout