Reputation
Badges 1
53 × Eureka!I will elaborate on the situation:
I have 2 agents - training and training_2. They are both listening to the services queue, but only 'training' pulls the tasks. At the beginning I had 2 tasks in the services queue. Then, agent 'training' pulled one and is currently executing it, but for some reason - it also pulled the 2nd task into it's queue, that is although this agent is not free and I have another agent which is: 'training_2'.
SuccessfulKoala55 I can't share the logs.
And afterwards, I have the following output that continues for 300 iterations without further reports of metrics
CostlyOstrich36 yes - sorry for the wrong terminology
this is the correct file
In another task I have tried to evaluate this metric but received similar error :clearml.automation.optimization - WARNING - Could not find requested metric ('evaluate', 'val_loss') report on base task
But I can add screenshots of the log file if necessary
AgitatedDove14 and what about 'epoch_loss' without validation? That's the scalar I am interested to understand where it comes from.. I thought that was just the loss reported at the end of the train epoch via tf
also tried to connect from to dataset from CLI and recieved connection error:
In the child dataset task I see the following :
ARTIFACTS - > STATE:Dataset state
Files added/modified: 1 - total size 518.78 MB
Current dependency graph: {
"0385db....": [],
()"94f4....": ["0385db..."]
}
child task is 94f4..
and parent task is "0385db..."
but what does the () line means?
Sending you to private CostlyOstrich36
I attached the log of the tensor board.
The scalar reported to the tf log is : 0.2631
The scalar reported to the scalars board in the UI is : 0.121
There is a major difference between the two
I would to delete a tag name from list of tags not from a task, ie from here:
Hi, adding the requirements to the code doesn't help because the task fails beforehand. CostlyOstrich36
Hi, the URL contains some details which I wouldn't like to share on this thread. Can I send it to one of you in private message?
This is not something that we defined or created- if I understand your question. It is created once a ClearML task is run, and there until the lock is deleted (which is something we do to handle another error I posted here about)
EDIT
I have disabled VCS-cache and it seems that the multiple cache files are still created when running a new task. Also still the lock is created once a new experiment is run: first image - cache after removing lock, second image - a few seconds later after running a new task. Also attached log output of the task uploaded (with ### replacing non relevant details).
Yeah this is a lock which is always in our cache, cant figure out why it's there, but when I delete the lock and the other files, they always reappear when I run a new clearml task.
Another thing I should note: I have recently had an error which fix was to run git config --global --add safe.directory /root/.clearml/vcs-cache/r__ (git repo name).d7f
Ever since, once I run a new task - a new file appears in the cache with the format of <git repo name.lock file name_a bunch of numbers>
Does this relate to the error below? from reading the issue I didn't see anyone mentioning this error -clearml_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='
https://gitlab.com/data_science_team/____ ', branch='ilanit', commit_id='b5c___', tag='', docker_cmd='ubuntu:18.04', entry_point='training/____.py', working_dir='src')
2) Check if remote-worker has valid credentials [see worker configuration file]
Another question: Is there a way to group together Dataset tasks (i.e redefine their parent) after the tasks have been finalized? In the same context: is there a way to change the dependency graph in the clearml dashboard after the task creation and finalization ?
CostlyOstrich36 Another clarification:
The master branch cache is stored at ". clearml/vcs-cache " - the code file doesn't exist there + the problem described above is occuring in this folder (multiple cache files of same repo).
While my branch is stored at " .clearml/venvs-builds/3.7/task_repository/"
CumbersomeCormorant74 As you can see in the attached - there were 2 experiments at the same time, but only one agent pulled the task, even though the second agent was free and listening to the queue.
Hi, joining in into this thread. I just have the same error with my task, which is not an optimization one.created virtual environment CPython3.7.5.final.0-64 in 232ms
creator CPython3Posix(dest=/home/ubuntu/.clearml/venvs-builds/3.7, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/ubuntu/.local/share/virtualenv)
` added seed packages: pip==22.0.4, setuptool...
Can you check if there is a situation where there are tasks pending in that queue while one of the workers is idle?
In what queue ? In services there are no pending tasks because they were all pulled by 'training'.
just to maker sure, how do you start the agents? Are you using the
--services-mode
option?
I used clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
We have been trying to resolve the issue. I will comment here again if any more problems arise. Thanks!