Reputation
Badges 1
108 × Eureka!Hi Jake 👍 ,
Maybe the content is cached? The repo isn't big. I didn't realize the log was missing content. I believe I copied everything but I'll double check in a moment.
✨ It works ✨
Thanks @<1523701205467926528:profile|AgitatedDove14> 😁
I'm not sure why the logs were incomplete. I think part of the reason it wasn't pulling from the repo was that it was pulling from cache. I cleared the clearml cache for that project and reran it. This should be the full log.
Is there currently a way to bind the same GPU to multiple queues? I believe the agent complains last time I tried (which was a bit ago).
I just checked the clearml.conf and I'm not specifying any version of python for the agents.
This is odd, the ordering of the files is different and there appears to be some missing from the preview. But as far as I can tell the files aren't different. What am I missing here?
As far as I can tell there's nothing else running that isn't running on our hardware. Is there some way to see what application instances are active?
From the logs it looks like the HPO application finds a worker from the queue, attempts to serialize the config sent to the worker, and crashes because of the version conflict with Pyro4. But I don't think we control any of that. I might be misunderstanding something. 🙃
Thanks Martin. I read this method as "getting the data associated with the model training" not "getting metadata for the model". This is what I'm looking for.
Alright, I deleted everything in the ClearML web-app waited a day tried again, it seems to be showing a configuration object in the configuration section of the scheduler task again. I honestly don't know what changed. Maybe some strange caching on the server side that got cleaned up.
@<1523701205467926528:profile|AgitatedDove14> Question: Does the schedule_function option in the TaskScheduler.add_task() method run at the time the task is scheduled to execute? So if I pass a functi...
I found I was having this issue as well. I don't have an alias defined in the pipeline but in a task and I get the same error. I'm not hosting my own server but using the free web service at the moment.
Depending on the framework you're using it'll just hook into the save model operation. Every time you save a model, which will probably happen every epoch for some subset of the training. If you want to do it with the existing framework you could change the checkpoint so that it only clones the best model in memory and saves the write operation for last. The risk with this is if the training crashes, you'll lose your best model.
Optionally, you could also disable the ClearML integration with...
Hi again @<1523701435869433856:profile|SmugDolphin23> ,
The approach you suggested seems to be working albeit with one issue. It does correctly identify the different versions of the dataset when new data is added, but I get an error when I try and finalize the dataset:
Code:
if self.task:
# get the parent dataset from the project
parent = self.clearml_dataset = Dataset.get(
dataset_name="[LTV] Dataset",
dataset_project=...
I will add a gh issue. Is this part open source? Could I make a PR?
In the mean time I still need to implement this with the current version of ClearML. So the only way would be to have one variable per parent? Is there any smarter way to work around it?
If I wanted to do this with the ID, how would I approach it?
Yeah, it's because it's just hooking into the save operation and capturing the output, regardless of the parent call.
I think the PR is a good idea. I read the contribution guidelines. It talks about referencing an issue. Did you want me to duplicate this issue on the repo or is it enough to link to this thread?
Hi @<1523701087100473344:profile|SuccessfulKoala55> - We tried to delete some additional hyperparameter tunings but it doesn't seem to have impacted metrics stored. It's not clear to me what is occupying all the metric storage space.
I'm using pro. Sorry, for the delay, I didn't notice I never sent the response.
We have a server that has many agents running on it because there are many instances where training can be run over several agents as a single agent doesn't take up all the resources available to the server.
Thanks, that's exactly what I was looking for.
@<1539780284646428672:profile|PoisedElephant79> Are you sure you're not simply referring to the get operation? That seems to exclude archived datasets. But I don't see anything like that for the list_datasets operation.
I had 2 datasets on archive and 0 unarchived. When I ran the following command:
Dataset.list_datasets(dataset_project=self.task.get_project_name(), only_completed=True)
It returned two entrees for the two datasets I had on archive.