Could you please run the misbehaving example, try to add a breakpoint in clearml/backend_interface/task/task.py
in Task.update_output_model
on the line with url = output_model.update_weights(
, and tell me what the value of model_path
is? In case you're using virtual environments, clearml library should be installed somewhere in <virtual env directory>/lib/python3.10/site-packages/clearml/
The issue may be related to the fact that right now we have some edge cases when working with lightning >= 2.0, we should have better support in the upcoming release
Hey @<1564422650187485184:profile|ScaryDeer25> , we just released clearml==1.11.1rc2
which should solve the compatibility issues for lightning >= 2.0. Can you install it and check whether it solves your problem?
Ah, I see now. There are a couple of ways to achieve this.
- You can enforce that the pipeline steps execute within a predefined docker image that has all these submodules - this is not very flexible, but doesn't require your clearml-agents to have access to your Git repository
- You can enforce that the pipeline steps execute within a predefined git repository, where you have all the code for these submodules - this is more flexible than option 1, but will require clearml-agents to have acce...
Hey @<1569858449813016576:profile|JumpyRaven4> , about your first point, what exactly is the question?
About your second point - you can try to manually save the final model and give it a proper file name, that way we will show it in the UI with the name you provided. Make sure to use xgboost.save_model
and not raw pickle.
For your final question , given that your models have customised code, I can suggest trying to use clearml.OutputModel
which will register the file you provide ...
Hey @<1603198163143888896:profile|LonelyKangaroo55> If you only use the summary writer, does it report properly to both TB and ClearML?
Hey @<1654294828365647872:profile|GorgeousShrimp11> can you abort all pending experiments that wait to be fetched from this queue and try again ? Off the top of my head it could be that the clearml-agent can’t pull the custom docker image. In general you should treat the docker images not as step definitions but only as the environment , hence setting the entrypoint is not necessary
Which gives me an idea. Could you please remove the entrypoint from the docker image altogether and try again ?
Overriding the entrypoint in the image can lead to docker run/docker exec failing to work properly , because instead of a shell it will use your entrypoint to run everything
Then change from git+ssh
to git+https
Hey @<1545216070686609408:profile|EnthusiasticCow4> , for requirements pointing to packages in git repositories you need to make sure that the environment the agent is running in has the valid credentials to access the repo. In your case ( git+ssh
) it means you need to have a pair of ssh keys, and the public key should be registered with the repo.
If your git credentials are stored in the agent's clearml.conf
it means these are a HTTPS username/password pair. But you specified that the package should be downloaded via git ssh, for which I assume you don't have credentials in agent's environment. So it can't authenticate with SSH, and PIP doesn't know how to switch from git+ssh to git+https, because the downloading of the package is done by PIP not by clearml.
And there probably are auth errors if you scroll through the entire log ...
Can you also tell what OS are you using? And when you mentioned that the clearml version: 1.5.1
did you mean the ClearML package or the clearml-agent
package? Because they are different
Hey @<1526734437587357696:profile|ShaggySquirrel23> , what version of the clearml-agent are you using? Also, if I were you I’d check how much free disk there’s on the machine running the agents
I can't quite reproduce your issue. From the traceback it seems it has something to do with torch.load
. I tried both your code snippet and creating a PyTorch model and then loading it, neither led to this error.
Could you provide a code snippet that is more like the code that is causing the issue? Also, can you please tell what clearml version are you using, and what is the Model URL in the UI? You can use the same filters in UI as the ones you used for Model.query_models
to find th...
Can you please attach the full traceback here?
To link a dataset to a task you need to pass the alias=
parameter to the Dataset.get
. See here: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#accessing-datasets
Hey Pawel, thanks for opening the PR on Ultralytics’ side. The full support should come from them, so if it’s missing for YOLOv8 it means they didn’t enable it. Still , you can try clearml-task
for auto-logging support in case of remote execution .
Also, I’d say you could easily have the possibility to use a ClearML dataset id as input to YOLOv8 with a few lines of code by basically downloading/ get
ing the dataset by id yourself and passing the path to it as input to the ultralytics...
Hey @<1639074542859063296:profile|StunningSwallow12> what exactly do you mean by "training in production"? Maybe you can elaborate what kind of models too.
ClearML in general assigns a unique Model ID to each model, but if you need some other way of versioning, we have support for custom tags, and you can apply those programmatically on the model
Hey @<1577468626967990272:profile|PerplexedDolphin99> , yes, this method call will help you limit the number of files you have in your cache, but not the total size of your cache. To be able to control the size, I’d recommend checking the ~/clearml.conf
file in the sdk.storage.cache
section
Yes, that is correct. Btw, not it looks more like my clearml.conf
Hey @<1535069219354316800:profile|PerplexedRaccoon19> , yes it does. Take a look at this example, and let me know if there are any more questions: None
Yes, you can do that. But it may make it harder to identify the task later on
To copy the artifacts please refer to docs here: None
Hey, yes, the reason for this issue seems to be our currently limited support for lightning 2.0. We will improve the support in the following releases. Right now one way to circumvent this issue, that I can recommend, is to use torch.save
if possible, because we fully support automatic model capture on torch.save
calls.
This sounds like you don't have clearml installed in the ubuntu container. Either this, or your clearml.conf
in the container is not pointing to the server, as a result all information is missing.
I'd rather suggest you change the approach, and run a clearml-agent
setup with docker
and when you want to run YOLOv5 training you actually execute it remotely on the queue that the agent is listening to
To my knowledge, no. You'd have to create your own front-end and use the model served with clearml-serving via an API
Are referring to the clearml-serving
project ?
This is doing fine-tuning. Training a multi-billion parameter model from scratch would be economically unfeasible for most of existing enterprises