Thanks for pointing this out, we will need to update our documentation. Still, if you manually inspect the ~/clearml.conf file you will see the available configurations
Can you update the clearml version to latest (1.11.1) and see whether the issue is fixed?
Yes, that is correct. Btw, not it looks more like my clearml.conf
Sounds interesting. But my main concern with this kind of approach is if the surface of the (hparam1, hparam2, objective_fn_score) is non-convex, using your method you may not reach the best set of hyperparameters. Maybe try using smarter search algorithms, like BOHB or TPE if you have a large search space, otherwise, you can try to do a few rounds of manual random search, reducing the search space around the region of most-likely best hyperparameters after every round.
As for why struct...
Can you please attach the full traceback here?
Hey @<1523704757024198656:profile|MysteriousWalrus11> , given your use case, did you consider passing the path to the dataset? Like an address to an S3 bucket
And how many agents do you have listening on the “services“ queue?
Yes, works with GCP too
I ’ m afraid serializing an entire class won’t be possible , but create_function_task will send the entire environment for remote execution , so you can still access your code
It happens due to an internal use of Dataset.get , the larger the dataset, the more verbose it will be. We’ll fix this in the upcoming releases
Hey @<1603198163143888896:profile|LonelyKangaroo55> If you only use the summary writer, does it report properly to both TB and ClearML?
Hey @<1523705721235968000:profile|GrittyStarfish67> , we have just released 1.12.1 with a fix for this issue
I think you can set the cuda version in the clearml.conf , alternatively you can have the agent use a docker image with your required version of cuda instead of setting the environment directly on the machine
Hey @<1547390438648844288:profile|ScaryJellyfish75> , can you provide the whole code for the pipeline, and also mention what clearml version are you using?
Hey Sana, yes you can. When you open the link, check on the upper-right side the Task's menu bar, and you will notice that you can clone the shared task.
Do you know whether the agent VM/image has python 3.9 installed ? Also, you emphasised that this happens when setting the package manager to poetry, does it mean this issue doesn’t happen when leaving package manager settings to default values ?
Can you paste here the code of the pipeline that you're trying to run?
And the quota is not cumulative , otherwise we’d run out of storage with the oldest accounts 😃
Is this a jupyter notebook or something ? Can you download it properly as either a .ipynb or .py file?
Can you please check with the latest 1.10.2 SDK version if the checkpointing issue still happens. As for the example code which couldn't be reproduced, we're already working on it and should have a fix for it for the next minor SDK version
This sounds like you don't have clearml installed in the ubuntu container. Either this, or your clearml.conf in the container is not pointing to the server, as a result all information is missing.
I'd rather suggest you change the approach, and run a clearml-agent setup with docker and when you want to run YOLOv5 training you actually execute it remotely on the queue that the agent is listening to
To link a dataset to a task you need to pass the alias= parameter to the Dataset.get . See here: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#accessing-datasets
Hey @<1654294828365647872:profile|GorgeousShrimp11> can you abort all pending experiments that wait to be fetched from this queue and try again ? Off the top of my head it could be that the clearml-agent can’t pull the custom docker image. In general you should treat the docker images not as step definitions but only as the environment , hence setting the entrypoint is not necessary
Glad I could be of help
This sounds like a use case for the enterprise version of ClearML. In it you can set read/write permissions. Publishing is considered a "write", so you can limit who can do it. Another thing that might be useful in your scenario is to try using "Reports", and connect the "approved" experiments info to a report and then publish it. Here's a short video introducing reports .
By the way, please note that if the experiment/report/whatever is publis...
You can try to add the force_download=True flag to .get() to ignore the locally cached content. Let me know if it helps.
Which gives me an idea. Could you please remove the entrypoint from the docker image altogether and try again ?
Overriding the entrypoint in the image can lead to docker run/docker exec failing to work properly , because instead of a shell it will use your entrypoint to run everything
Also, make sure you use Task.init instead of task.init