Hey @<1523704157695905792:profile|VivaciousBadger56> , I was playing around with the Pipelines a while ago, and managed to create one where I have a few steps in the begining creating and ClearML datasets like users_dataset , sessions_dataset , prefferences_dataset , then I have a step which combines all 3, then an independent data quality step which runs in parallel with the model training. Also, if you want to have some fun, you can try to parametrize your pipelines and run HPO on...
Wait, my config looks a bit different, what clearml package version are you using?
This sounds like you don't have clearml installed in the ubuntu container. Either this, or your clearml.conf in the container is not pointing to the server, as a result all information is missing.
I'd rather suggest you change the approach, and run a clearml-agent setup with docker and when you want to run YOLOv5 training you actually execute it remotely on the queue that the agent is listening to
Ah, I see now. There are a couple of ways to achieve this.
- You can enforce that the pipeline steps execute within a predefined docker image that has all these submodules - this is not very flexible, but doesn't require your clearml-agents to have access to your Git repository
- You can enforce that the pipeline steps execute within a predefined git repository, where you have all the code for these submodules - this is more flexible than option 1, but will require clearml-agents to have acce...
Hey @<1639799308809146368:profile|TritePigeon86> , given that you want to retry on connection error, wouldn't it be easier to use retry_on_failure from PipelineController / PipelineDecorator.pipeline None ?
Hey @<1639074542859063296:profile|StunningSwallow12> what exactly do you mean by "training in production"? Maybe you can elaborate what kind of models too.
ClearML in general assigns a unique Model ID to each model, but if you need some other way of versioning, we have support for custom tags, and you can apply those programmatically on the model
This is doing fine-tuning. Training a multi-billion parameter model from scratch would be economically unfeasible for most of existing enterprises
The line before the last in your code snippet above. pipe.start_locally .
Can you try checking if you have access to the model in the shared experiment?
This sounds like a use case for the enterprise version of ClearML. In it you can set read/write permissions. Publishing is considered a "write", so you can limit who can do it. Another thing that might be useful in your scenario is to try using "Reports", and connect the "approved" experiments info to a report and then publish it. Here's a short video introducing reports .
By the way, please note that if the experiment/report/whatever is publis...
That's not that much. You can use the AWS autoscaler and provision a spot g4dn GPU instance with a bit more disk. This should cost you less than 50 cents an hour
Hey @<1681836303299121152:profile|RoundElk14> , it seems you are using a self-hosted ClearML server. This error you're getting happens because your email is not configured in the server. Ask your admin to perform the following steps:
- [The admin] Go to Settings > Users & Groups > Users and click on "+ Add User" where they will be prompted to specify the user's email
- [The user] Once the admin confirms that they did step 1, the user should first Sign In with their email to the server
- [The...
Can you please attach the code for the pipeline?
To my knowledge, no. You'd have to create your own front-end and use the model served with clearml-serving via an API
Hey Yasir, to use tensorflow prefetch your data needs to be (1) chunked and (2) stored on some server/bucket/network-attached FS. If both conditions are not satisfied, TF prefetch won't help you.
How large is the dataset we're talking about?
Hello @<1604647689662763008:profile|PerfectSwan93> , I tend to agree with you , option one is the best given your use-case. If you keep the same name and project it will result in a version bump on the combined dataset, but it will not point to the previous combined dataset as a parent.
I can't quite reproduce your issue. From the traceback it seems it has something to do with torch.load . I tried both your code snippet and creating a PyTorch model and then loading it, neither led to this error.
Could you provide a code snippet that is more like the code that is causing the issue? Also, can you please tell what clearml version are you using, and what is the Model URL in the UI? You can use the same filters in UI as the ones you used for Model.query_models to find th...
That seems strange. Could you provide a short code snippet that reproduces your issue?
Hey, yes, the reason for this issue seems to be our currently limited support for lightning 2.0. We will improve the support in the following releases. Right now one way to circumvent this issue, that I can recommend, is to use torch.save if possible, because we fully support automatic model capture on torch.save calls.
Glad I could be of help
Hey @<1577468626967990272:profile|PerplexedDolphin99> , yes, this method call will help you limit the number of files you have in your cache, but not the total size of your cache. To be able to control the size, I’d recommend checking the ~/clearml.conf file in the sdk.storage.cache section
Also, make sure you use Task.init instead of task.init
Ok, then launch an agent using clearml-agent daemon --queue default that way your steps will be sent to the agent for execution. Note that in this case, you shouldn't change your code snippet in any way.
Hey @<1671689458606411776:profile|StormySeaturtle98> we do support something called "Model Design" previews, basically an architecture description of the model, a la Caffe protobufs. None For example we store this info automatically with Keras
You can create a new dataset and specify the parent datasets as all the previous ones. Is that something that would work for you ?
Hey @<1546303293918023680:profile|MiniatureRobin9> , to help narrow down the problem, could you try to manually download None and open it with pickle ?
Also, is your agent running on the same machine as your server and the example pipeline code? And what Python version are you using for all three components? Because I see there's a warning `could not locate requested Python version 3.11, reverting t...
Hey @<1529271085315395584:profile|AmusedCat74> , I may be wrong , but I think you can’t attach a gpu to an e2 instance , it should be at least an n1, no?
Hey @<1535069219354316800:profile|PerplexedRaccoon19> , yes it does. Take a look at this example, and let me know if there are any more questions: None
To copy the artifacts please refer to docs here: None