Reputation
Badges 1
108 × Eureka!The answer is simple but also not completely obvious to someone new to the platform. So you can inject new command line args that hydra will recognize. This is what the Hydra section of args is for. However, if you enable _allow_omegaconf_edit_: True , I think ClearML will βinjectβ the OmegaConf saved under the configuration object of the prior run, overwriting the overrides. Iβll experiment with this behavior a bit more to be sure.
Hi @<1523701435869433856:profile|SmugDolphin23>
I'm a bit confused by your suggestion. To be clear, this is the logs from the HPO application instance that's spun up when you start the HPO process. I don't think we have any control over what python version or Pyro version is started in the application instance. I think this error occurs before any code on our end is run.
Yeah, it's because it's just hooking into the save operation and capturing the output, regardless of the parent call.
I've recently run into this error myself. Did you find any resolution?
Thanks Martin. I read this method as "getting the data associated with the model training" not "getting metadata for the model". This is what I'm looking for.
Ah, that makes sense. What is supposed to be hidden changes depending on the section your in, which makes sense. Now there needs to a packman sprite easter egg hidden somewhere else.
This does appear to resolve the issue. I'll keep you updated if I find any other issues. Thanks @<1523701435869433856:profile|SmugDolphin23>
We have a server that has many agents running on it because there are many instances where training can be run over several agents as a single agent doesn't take up all the resources available to the server.
From the logs it looks like the HPO application finds a worker from the queue, attempts to serialize the config sent to the worker, and crashes because of the version conflict with Pyro4. But I don't think we control any of that. I might be misunderstanding something. π
@<1539780284646428672:profile|PoisedElephant79> Are you sure you're not simply referring to the get operation? That seems to exclude archived datasets. But I don't see anything like that for the list_datasets operation.
Maybe the sleep between scheduler.mark_completed() and scheduler.delete() is too short? But I don't get why deleting the old scheduler task would break the new scheduler. I'm going to try testing by running the scheduler locally.
As far as I can tell there's nothing else running that isn't running on our hardware. Is there some way to see what application instances are active?
There is no issues when I run the "raw" script. Also, since it's based on tasks, the code must have run without fault for it to be pulled as a task in the pipeline.
As for when it fails, looking at the log here it looks like it's on the first task or maybe as the first task is launching. But I'd have to go back to be sure. I rolled back to 1.13.1 and that's working fine. But, if you want I can help explore this bug in detail because it would be nice to find the root of the issue. LmK what y...
Strange, the code seems to work perfectly when I run it locally. To make it more confusing, the queue that I enqueue it to when I run it remotely is using agents from the same server that I'm running it locally from.
You might want to start with the first steps guide then:
None
Is there currently a way to bind the same GPU to multiple queues? I believe the agent complains last time I tried (which was a bit ago).
This doesn't really make a lot of sense. ClearML would be better served for tracking which version of the code you used for a corresponding task and you'd use something like github or gitlab to track code and host your code. You could use ClearML to help you reconstruct the environment and code from a task given it's being tracked by git and hosted somewhere you can access.
After some digging we found it was actually caused by the routers IPS protection. I thought it would be strange for github to be throttling things at this scale.
@<1523701435869433856:profile|SmugDolphin23> Yeah, I just wanted to validate it was worth spending the time. Since there is already a parameter that takes callable (i.e. schedule_function ) it might make sense that we reuse the parameter. If it returns a str we validate that it's a task and if it does we can run the task as if we originally passed it as the task_id in .add_task() . This would only be a breaking change if the callable that was passed happened to return a task_id ...
Is it possible the cached repository was cloned before you changed your agent settings?
Which settings are you referring to? I can't remember if I was using https auth when the project would have been first cached. Would that make a difference?
Also, did you set
agent.enable_git_ask_pass: true
?
The only instance of it in the config is commented out.
# if set, use GIT_ASKPASS to pass user/pass when cloning / fetch repositories
# it solves pas...
1707128614082 bigbrother:gpu0 INFO task 59d23c5919b04fd6947c1e463fa8c78c pulled from 9890a035b8f84872ab18d7ff207c26c6 by worker bigbrother:gpu0
Current configuration (clearml_agent v1.7.0, location: /tmp/.clearml_agent.vo_oc47r.cfg):
----------------------
agent.worker_id = bigbrother:gpu0
agent.worker_name = bigbrother
agent.force_git_ssh_protocol = true
agent.python_binary = /home/natephysics/anaconda3/bin/python
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = ...
β¨ It works β¨
Thanks @<1523701205467926528:profile|AgitatedDove14> π
Thanks Eugen for the quick reply. If I can add a suggestion/comment from my perspective: Why is schedule_function included in the .add_task() method? As far as I can tell if you use schedule_function it changes the very nature of the method, it's no longer adding a task but adding a function . It seems like it would make more sense if this was broken into something like an .add_function() method. Also, if you call schedule_function many of the other parameters in `.add...
Yes, it indeed appears to be a regex issue. If I run:
Dataset.list_datasets(
dataset_project=self.task.get_project_name(),
partial_name=re.escape('[LTV] Dataset Test'),
only_completed=True,
)
It works as expected. I'm not sure how raw you want to leave the partial_name features. I could create a PR to fix this but would you want me to re.escape at the list_datasets() level? Or go deeper and do it at `Task._query_task...
I made a video of the Scheduler config error. You can see that the same code run locally works and doesn't on remote. (I just uploaded the video so the quality might suffer until YT finishes processing the higher resolution versions).
It seems that the error is related to this part of the code block. However, when I comment this out I get the error I had 2 days ago with the missing configuration object.
Actually, clearing the cache on the other project might have fixed it. I just tested it out and it seems to be working.
The plot thickens. It seems like there's something odd going on with the interaction between [LTV] and additional text. If I just search [LTV] it works, if I just search Dataset Test it works, but if I put them together it breaks the search. Now that I think about it, there's other oddities that seem to happen in the web interface that might be explained by some bugs around using brackets in names.