
Reputation
Badges 1
108 × Eureka!Is there currently a way to bind the same GPU to multiple queues? I believe the agent complains last time I tried (which was a bit ago).
Do you start the clearml agents on the server with the same user that has the credentials saved?
@<1523701205467926528:profile|AgitatedDove14>
And the Task is still running? What's he clearml python version and webui version ?
No, the task stops (it's running remote, I haven't tested it running local).
It did update but I'm beginning to see the issue now. It seems like the metrics on thousands of experiments amounted to a few MB where deleting one of the hyperparameter experiments freed up over a gig. I'm having difficulty seeing why one of these experiments occupies so much metric space. From the hyperparameter optimization dashboard the graphs and the tables might have a few thousand points.
Can you help me:
- Better understanding why it's occupying so much metric storage space
- Is the...
That's what I was getting at. It wasn't clear to me from the documentation that it saves the state.
No error. Just a new task each time.
@<1539780284646428672:profile|PoisedElephant79> Sorry for not getting back with this sooner. Dataset.get() doesn't work like you suggested. In the documentation it's clear:
Get a
specific
Dataset. If multiple datasets are found, the dataset with the highest semantic version is returned. If no semantic version is found, the most recently updated dataset is returned. This functions raises an Exception in case no dataset can be found and the
auto_create=True
...
Thanks for the reply @<1523701070390366208:profile|CostlyOstrich36> !
It says in the documentation that:
Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded
It seems to recognize the dataset as another version of the data but doesn't seem to be validating the hashes on a per file basis. Also, if you look at the photo, it seems like some of the data does get recognized as the same as the prior data. It seems like it's the correct...
This is odd, the ordering of the files is different and there appears to be some missing from the preview. But as far as I can tell the files aren't different. What am I missing here?
Actually, clearing the cache on the other project might have fixed it. I just tested it out and it seems to be working.
They will be related through the task. Get the task information from the dataset, then get the model information from the task.
I will add a gh issue. Is this part open source? Could I make a PR?
In the mean time I still need to implement this with the current version of ClearML. So the only way would be to have one variable per parent? Is there any smarter way to work around it?
Since this could happen with a lot of services, maybe it would be worth a retry option? Especially if it's part of a pipeline.
After some digging we found it was actually caused by the routers IPS protection. I thought it would be strange for github to be throttling things at this scale.
Hi @<1523701435869433856:profile|SmugDolphin23>
I'm a bit confused by your suggestion. To be clear, this is the logs from the HPO application instance that's spun up when you start the HPO process. I don't think we have any control over what python version or Pyro version is started in the application instance. I think this error occurs before any code on our end is run.
I just checked the clearml.conf and I'm not specifying any version of python for the agents.
Provide a bit more detail. What framework are you using?
This does appear to resolve the issue. I'll keep you updated if I find any other issues. Thanks @<1523701435869433856:profile|SmugDolphin23>
That make sense. I was confused what the source was.
I'm not self-hosting the server.
Thanks again for the info. I might experiment with it to see first hand what the advantages are.
I actually ran into the exact same problem. The agents aren't hosted on AWS though, just a in-house server.
I made a video of the Scheduler config error. You can see that the same code run locally works and doesn't on remote. (I just uploaded the video so the quality might suffer until YT finishes processing the higher resolution versions).
@<1523701205467926528:profile|AgitatedDove14> Then it isn't working at intended. To test it I started the scheduler and set a simple dead man snitch process to run once a day. In the web-app (on your site app.cleearml.ml), when looking at the scheduler process in the DevOps section, I was able to see a configuration file under artifacts but it was not as all obvious how you'd change that because it wasn't part of the configuration section, it was just an artifact. So I thought maybe it was b...