Hey @<1523701066867150848:profile|JitteryCoyote63> , could you please open a GH issue on our repo too, so that we can more effectively track this issue. We are working on it now btw
What happens if you comment or remove the pipe.set_default_execution_queue('default')
and use run_locally
instead of start_locally
?
Because in the current setup, you are basically asking to run the pipeline controller task locally, while the rest of the steps need to run on an agent machine. If you do the changes I suggested above, you will be able to run everything on your local machine.
Ok, then launch an agent using clearml-agent daemon --queue default
that way your steps will be sent to the agent for execution. Note that in this case, you shouldn't change your code snippet in any way.
The line before the last in your code snippet above. pipe.start_locally
.
Do you know whether the agent VM/image has python 3.9 installed ? Also, you emphasised that this happens when setting the package manager to poetry, does it mean this issue doesn’t happen when leaving package manager settings to default values ?
Hello @<1523710243865890816:profile|QuaintPelican38> , could you try Dataset.get
ing an existent dataset and tell whether there are any errors or not?
Could you please run the misbehaving example, try to add a breakpoint in clearml/backend_interface/task/task.py
in Task.update_output_model
on the line with url = output_model.update_weights(
, and tell me what the value of model_path
is? In case you're using virtual environments, clearml library should be installed somewhere in <virtual env directory>/lib/python3.10/site-packages/clearml/
Hello @<1604647689662763008:profile|PerfectSwan93> , I tend to agree with you , option one is the best given your use-case. If you keep the same name and project it will result in a version bump on the combined dataset, but it will not point to the previous combined dataset as a parent.
Hey @<1523701949617147904:profile|PricklyRaven28> , about the S3 loading issue. The path to the model in the artifact tab, is it an S3 bucket or a local path?
You can create a new dataset and specify the parent datasets as all the previous ones. Is that something that would work for you ?
Hey Sana, yes you can. When you open the link, check on the upper-right side the Task's menu bar, and you will notice that you can clone the shared task.
Can you try checking if you have access to the model in the shared experiment?
Do you mean that you want your published experiments to be either “approved” or “not approved“ based on the presence of the attachments you mentioned ?
You can try to add the force_download=True
flag to .get()
to ignore the locally cached content. Let me know if it helps.
Also, make sure you use Task.init
instead of task.init
Glad I could be of help
Hey @<1523704757024198656:profile|MysteriousWalrus11> , given your use case, did you consider passing the path to the dataset? Like an address to an S3 bucket
Hey @<1546303293918023680:profile|MiniatureRobin9> , to help narrow down the problem, could you try to manually download None and open it with pickle
?
Also, is your agent running on the same machine as your server and the example pipeline code? And what Python version are you using for all three components? Because I see there's a warning `could not locate requested Python version 3.11, reverting t...
Hey @<1547390444877385728:profile|ThickSnake12> , how exactly do you access the artifact next time? Can you provide a code sample?
Hey @<1554275802437128192:profile|CumbersomeBee33> , aborted usually means that someone manually stopped the pipeline or one of it's experiments. Can you provide us with the code you used to run it?
That is not specific enough. Can you show the code? And ideally also the console log of the pipeline
Is this a jupyter notebook or something ? Can you download it properly as either a .ipynb or .py file?
Hey, yes, the reason for this issue seems to be our currently limited support for lightning 2.0. We will improve the support in the following releases. Right now one way to circumvent this issue, that I can recommend, is to use torch.save
if possible, because we fully support automatic model capture on torch.save
calls.
I think you can set the cuda version in the clearml.conf
, alternatively you can have the agent use a docker image with your required version of cuda instead of setting the environment directly on the machine
That seems strange. Could you provide a short code snippet that reproduces your issue?
Can you please check with the latest 1.10.2 SDK version if the checkpointing issue still happens. As for the example code which couldn't be reproduced, we're already working on it and should have a fix for it for the next minor SDK version
Hey @<1535069219354316800:profile|PerplexedRaccoon19> , yes it does. Take a look at this example, and let me know if there are any more questions: None
To copy the artifacts please refer to docs here: None