Reputation
Badges 1
53 × Eureka!this is the correct file
Because of a server error I can't download the log so I attached a screenshot. In the log I see only the following reports (without a summary table/plot).
CostlyOstrich36 The application problem was indeed solved 🙂 but the plots one didn't
Unfortunately, I am not running on a community server
I really like the first idea, but I already see a problem if I make changes to the file, I will need to re-upload it every time?
I am currently using the repo cache, but unfortunately every time I run a new task with an existing cached repo, a new cache file is created.. very weird behaviour which I have already described in previous threads ( https://clearml.slack.com/archives/CTK20V944/p1651393535530439?thread_ts=1651063782.712379&cid=CTK20V944 )
But I can add screenshots of the log file if necessary
CostlyOstrich36 yes - sorry for the wrong terminology
CumbersomeCormorant74 As you can see in the attached - there were 2 experiments at the same time, but only one agent pulled the task, even though the second agent was free and listening to the queue.
I would to delete a tag name from list of tags not from a task, ie from here:
Yeah this is a lock which is always in our cache, cant figure out why it's there, but when I delete the lock and the other files, they always reappear when I run a new clearml task.
Another thing I should note: I have recently had an error which fix was to run git config --global --add safe.directory /root/.clearml/vcs-cache/r__ (git repo name).d7f
Ever since, once I run a new task - a new file appears in the cache with the format of <git repo name.lock file name_a bunch of numbers>
EDIT CostlyOstrich36
third image - cache after running another task with new cache file created even though cache is disabled
In another task I have tried to evaluate this metric but received similar error :clearml.automation.optimization - WARNING - Could not find requested metric ('evaluate', 'val_loss') report on base task
I don't know if it has anything to do with it but I now see that the repo which is cloned and save in the cache is actually a different branch than the one which is pulled by the agent.
From the log file:repository =
https://gitlab.com/data_science_team/PROJECT_NAME branch = MYBRANCH
SOMENUMBER IP### DEBUG
warning: redirecting to https://gitlab.com/data_science_team/PROJECT_NAME.git/
This is not something that we defined or created- if I understand your question. It is created once a ClearML task is run, and there until the lock is deleted (which is something we do to handle another error I posted here about)
AgitatedDove14 and what about 'epoch_loss' without validation? That's the scalar I am interested to understand where it comes from.. I thought that was just the loss reported at the end of the train epoch via tf
Hi, adding the requirements to the code doesn't help because the task fails beforehand. CostlyOstrich36
As it seems torch.save()
just saves to the disk, so it seems there is no need for (server) parent path, just the name of the file , in my case : '
http://my_model.pt '
.
Thanks for you help CostlyOstrich36
yes of course I specified it with :torch.save(state, f_path)
where f_path = os.path.join(tempfile.gettempdir(), 'trained_models') + '\
http://my_model.pt '
Can you check if there is a situation where there are tasks pending in that queue while one of the workers is idle?
In what queue ? In services there are no pending tasks because they were all pulled by 'training'.
just to maker sure, how do you start the agents? Are you using the
--services-mode
option?
I used clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
We have been trying to resolve the issue. I will comment here again if any more problems arise. Thanks!
Thanks! the second link is exactly that what I was looking for 🙂
I had a task which I have cloned and reset a bunch of times, when I created the test as a new one, the error didnt appear again