
Reputation
Badges 1
53 × Eureka!Can you check if there is a situation where there are tasks pending in that queue while one of the workers is idle?
In what queue ? In services there are no pending tasks because they were all pulled by 'training'.
just to maker sure, how do you start the agents? Are you using the
--services-mode
option?
I used clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
Sending you to private CostlyOstrich36
CumbersomeCormorant74 As you can see in the attached - there were 2 experiments at the same time, but only one agent pulled the task, even though the second agent was free and listening to the queue.
In another task I have tried to evaluate this metric but received similar error :clearml.automation.optimization - WARNING - Could not find requested metric ('evaluate', 'val_loss') report on base task
But I can add screenshots of the log file if necessary
I have some info that I wouldn't like to post here (due to security reasons), is there a way to share the link only with your user ? 🙂
CostlyOstrich36 yes - sorry for the wrong terminology
Another question: Is there a way to group together Dataset tasks (i.e redefine their parent) after the tasks have been finalized? In the same context: is there a way to change the dependency graph in the clearml dashboard after the task creation and finalization ?
also tried to connect from to dataset from CLI and recieved connection error:
EDIT
I have disabled VCS-cache and it seems that the multiple cache files are still created when running a new task. Also still the lock is created once a new experiment is run: first image - cache after removing lock, second image - a few seconds later after running a new task. Also attached log output of the task uploaded (with ### replacing non relevant details).
this is the correct file
Unfortunately, I am not running on a community server
EDIT CostlyOstrich36
third image - cache after running another task with new cache file created even though cache is disabled
What will happen if I disable the cache? Is there a way to find out which experiment is hung and why? in order to avoid this?
I had a task which I have cloned and reset a bunch of times, when I created the test as a new one, the error didnt appear again
Interesting I am only now seeing **optimizer_kwargs
it seems that it will fix my problem. Is it too much to ask if you could add an example of how to initiate the optuna object with the kwargs (mainly how to initiate 'trial', 'study', 'objective' arguments) ? 🙂
Thanks! the second link is exactly that what I was looking for 🙂
Yeah this is a lock which is always in our cache, cant figure out why it's there, but when I delete the lock and the other files, they always reappear when I run a new clearml task.
Another thing I should note: I have recently had an error which fix was to run git config --global --add safe.directory /root/.clearml/vcs-cache/r__ (git repo name).d7f
Ever since, once I run a new task - a new file appears in the cache with the format of <git repo name.lock file name_a bunch of numbers>
Hi, adding the requirements to the code doesn't help because the task fails beforehand. CostlyOstrich36
In the child dataset task I see the following :
ARTIFACTS - > STATE:Dataset state
Files added/modified: 1 - total size 518.78 MB
Current dependency graph: {
"0385db....": [],
()"94f4....": ["0385db..."]
}
child task is 94f4..
and parent task is "0385db..."
but what does the () line means?
As it seems torch.save()
just saves to the disk, so it seems there is no need for (server) parent path, just the name of the file , in my case : '
http://my_model.pt '
.
Thanks for you help CostlyOstrich36
SuccessfulKoala55 I can't share the logs.
Does this relate to the error below? from reading the issue I didn't see anyone mentioning this error -clearml_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='
https://gitlab.com/data_science_team/____ ', branch='ilanit', commit_id='b5c___', tag='', docker_cmd='ubuntu:18.04', entry_point='training/____.py', working_dir='src')
2) Check if remote-worker has valid credentials [see worker configuration file]
Just to clarify again - when I start the agents I run :clearml-agent daemon --detached --queue training
and then: clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
This is why there are 'training' and 'training_2' queues.
We have been trying to resolve the issue. I will comment here again if any more problems arise. Thanks!