Reputation
Badges 1
100 × Eureka!Alright, I fixed the issue with the scheduler eating itself. But now I'm still getting the same bug as two days ago. So the Scheduler process starts fine and doesn't "crash." But I don't get the config object in the web-app again. It seems to work if I run it locally.
To answer your earlier question, I'm using the app.clear.ml
portal so
- WebApp: 3.20.1-1525
- Server: 3.20.1-1299
- API: 2.28
- And my Python ClearML version: 1.14
Hyperdatasets are the only ones that require a premium. If you're using normal datasets it should be fine.
Oh, I get what's happening. That segment of the code is rerun when the task is enqueued remotely. So it's deleting itself. This also explains why it works fine locally. It's an ouroboros, the task is deleting itself.
I will add a gh issue. Is this part open source? Could I make a PR?
In the mean time I still need to implement this with the current version of ClearML. So the only way would be to have one variable per parent? Is there any smarter way to work around it?
I think the PR is a good idea. I read the contribution guidelines. It talks about referencing an issue. Did you want me to duplicate this issue on the repo or is it enough to link to this thread?
Hi again @<1523701435869433856:profile|SmugDolphin23> ,
The approach you suggested seems to be working albeit with one issue. It does correctly identify the different versions of the dataset when new data is added, but I get an error when I try and finalize the dataset:
Code:
if self.task:
# get the parent dataset from the project
parent = self.clearml_dataset = Dataset.get(
dataset_name="[LTV] Dataset",
dataset_project=...
Actually this is not how it works, pip will install in any way it sees fit, and it is not consistent between versions (it has to do with dependency resolving)
Oh I see. What a pain. 🤣
You can configure the agent to first install specific packages, and only then others, just add the package names here:
That's an interesting solution. I'll keep that in mind as I work more with ClearML.
Thanks for your help Martin!
Interesting approach. I'll give that a try. Thanks for the reply!
It did update but I'm beginning to see the issue now. It seems like the metrics on thousands of experiments amounted to a few MB where deleting one of the hyperparameter experiments freed up over a gig. I'm having difficulty seeing why one of these experiments occupies so much metric space. From the hyperparameter optimization dashboard the graphs and the tables might have a few thousand points.
Can you help me:
- Better understanding why it's occupying so much metric storage space
- Is the...
The plot thickens. It seems like there's something odd going on with the interaction between [LTV]
and additional text. If I just search [LTV]
it works, if I just search Dataset Test
it works, but if I put them together it breaks the search. Now that I think about it, there's other oddities that seem to happen in the web interface that might be explained by some bugs around using brackets in names.
Stability is for wimps. I live on the edge of brining down production at any moment, like a real developer. But thanks for the update! 🙃
@<1523701435869433856:profile|SmugDolphin23> I spoke too soon. It does resolve the error I posted but it introduces a new error. While this error does seem to be related to VS Code the strange thing is it doesn't occur if I run it with earlier versions of clearml
.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/natephysics/.vscode-server/extensions/ms-python.python-2023.22.1/pythonFiles/lib/python/debugpy/_vendo...
I made a video of the Scheduler config error. You can see that the same code run locally works and doesn't on remote. (I just uploaded the video so the quality might suffer until YT finishes processing the higher resolution versions).
Why? That's not how I authenticate. Also, if it was simply an issue with authentication wouldn't there be some error message in the log?
Yeah, it's because it's just hooking into the save operation and capturing the output, regardless of the parent call.
Yes, it indeed appears to be a regex issue. If I run:
Dataset.list_datasets(
dataset_project=self.task.get_project_name(),
partial_name=re.escape('[LTV] Dataset Test'),
only_completed=True,
)
It works as expected. I'm not sure how raw you want to leave the partial_name features. I could create a PR to fix this but would you want me to re.escape at the list_datasets()
level? Or go deeper and do it at `Task._query_task...
@<1523701070390366208:profile|CostlyOstrich36> Just pinging you 😄
I think this error occurred for me because when I first authenticated with the project I was using username/password and later I transitioned to using ssh keys. That's why clearing the cache worked.
Did you validate that branch exists on remote?
Yes, I'm experimenting with this. I actually wrote my own process to do this so I just had to adapt it as a callable to pass to the scheduler. However, I'm running into an issue and I don't think this is a user error this time. When I start the scheduler, it starts running, shows up in the web-app, but then an error message in the web-app pops up Fetch parents failed
and the Scheduler task disappears from the web-app. I can't even see an error log because the task is gone.
I'm running th...
Oh, duh. I'll test that out. But I did have the agent.force_git_ssh_protocol: true
Thanks Eugen for the quick reply. If I can add a suggestion/comment from my perspective: Why is schedule_function
included in the .add_task()
method? As far as I can tell if you use schedule_function
it changes the very nature of the method, it's no longer adding a task but adding a function . It seems like it would make more sense if this was broken into something like an .add_function()
method. Also, if you call schedule_function
many of the other parameters in `.add...
Alright, I deleted everything in the ClearML web-app waited a day tried again, it seems to be showing a configuration object in the configuration section of the scheduler task again. I honestly don't know what changed. Maybe some strange caching on the server side that got cleaned up.
@<1523701205467926528:profile|AgitatedDove14> Question: Does the schedule_function
option in the TaskScheduler.add_task()
method run at the time the task is scheduled to execute? So if I pass a functi...
Actually, clearing the cache on the other project might have fixed it. I just tested it out and it seems to be working.
So far when I delete a task or dataset using the web interface that has artifacts on S3 it doesn't prompt me for credentials.
It hooks into the calls made by the code. If you never save the model to disk, add it to a tool like MLflow/Tensorboard, or manually add the artifact to ClearML, afaik it won't save the artifact.
It seems that the error is related to this part of the code block. However, when I comment this out I get the error I had 2 days ago with the missing configuration object.