
Reputation
Badges 1
53 × Eureka!We have been trying to resolve the issue. I will comment here again if any more problems arise. Thanks!
I would to delete a tag name from list of tags not from a task, ie from here:
Can you check if there is a situation where there are tasks pending in that queue while one of the workers is idle?
In what queue ? In services there are no pending tasks because they were all pulled by 'training'.
just to maker sure, how do you start the agents? Are you using the
--services-mode
option?
I used clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
AgitatedDove14 and what about 'epoch_loss' without validation? That's the scalar I am interested to understand where it comes from.. I thought that was just the loss reported at the end of the train epoch via tf
where should I look to see this metric? at the scalars tab?
I indeed have different scalar there :val_loss
but I have reported this metric in the checkpoint not in the logger..
I attached the log of the tensor board.
The scalar reported to the tf log is : 0.2631
The scalar reported to the scalars board in the UI is : 0.121
There is a major difference between the two
This is not something that we defined or created- if I understand your question. It is created once a ClearML task is run, and there until the lock is deleted (which is something we do to handle another error I posted here about)
CostlyOstrich36 Another clarification:
The master branch cache is stored at ". clearml/vcs-cache " - the code file doesn't exist there + the problem described above is occuring in this folder (multiple cache files of same repo).
While my branch is stored at " .clearml/venvs-builds/3.7/task_repository/"
And afterwards, I have the following output that continues for 300 iterations without further reports of metrics
In the child dataset task I see the following :
ARTIFACTS - > STATE:Dataset state
Files added/modified: 1 - total size 518.78 MB
Current dependency graph: {
"0385db....": [],
()"94f4....": ["0385db..."]
}
child task is 94f4..
and parent task is "0385db..."
but what does the () line means?
Just to clarify again - when I start the agents I run :clearml-agent daemon --detached --queue training
and then: clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
This is why there are 'training' and 'training_2' queues.
also tried to connect from to dataset from CLI and recieved connection error:
In another task I have tried to evaluate this metric but received similar error :clearml.automation.optimization - WARNING - Could not find requested metric ('evaluate', 'val_loss') report on base task
Hi, adding the requirements to the code doesn't help because the task fails beforehand. CostlyOstrich36
EDIT CostlyOstrich36
third image - cache after running another task with new cache file created even though cache is disabled
Yeah this is a lock which is always in our cache, cant figure out why it's there, but when I delete the lock and the other files, they always reappear when I run a new clearml task.
Another thing I should note: I have recently had an error which fix was to run git config --global --add safe.directory /root/.clearml/vcs-cache/r__ (git repo name).d7f
Ever since, once I run a new task - a new file appears in the cache with the format of <git repo name.lock file name_a bunch of numbers>
My questions are:
- how can I avoid creating tens of new cache files?
- do you happen to know why this lock is created and how it is connected to the above error (in the link - regarding "failing to clone.. ")
SuccessfulKoala55 I can't share the logs.
Because of a server error I can't download the log so I attached a screenshot. In the log I see only the following reports (without a summary table/plot).
I have some info that I wouldn't like to post here (due to security reasons), is there a way to share the link only with your user ? 🙂
CostlyOstrich36 The application problem was indeed solved 🙂 but the plots one didn't
Sending you to private CostlyOstrich36
Unfortunately, I am not running on a community server
Interesting I am only now seeing **optimizer_kwargs
it seems that it will fix my problem. Is it too much to ask if you could add an example of how to initiate the optuna object with the kwargs (mainly how to initiate 'trial', 'study', 'objective' arguments) ? 🙂
Thank you for the clarification, everything is clear now 🙂
I had a task which I have cloned and reset a bunch of times, when I created the test as a new one, the error didnt appear again