
Reputation
Badges 1
53 × Eureka!I don't know if it has anything to do with it but I now see that the repo which is cloned and save in the cache is actually a different branch than the one which is pulled by the agent.
From the log file:repository =
https://gitlab.com/data_science_team/PROJECT_NAME branch = MYBRANCH
SOMENUMBER IP### DEBUG
warning: redirecting to https://gitlab.com/data_science_team/PROJECT_NAME.git/
Sending you to private CostlyOstrich36
Another question: Is there a way to group together Dataset tasks (i.e redefine their parent) after the tasks have been finalized? In the same context: is there a way to change the dependency graph in the clearml dashboard after the task creation and finalization ?
CostlyOstrich36 Another clarification:
The master branch cache is stored at ". clearml/vcs-cache " - the code file doesn't exist there + the problem described above is occuring in this folder (multiple cache files of same repo).
While my branch is stored at " .clearml/venvs-builds/3.7/task_repository/"
I indeed have different scalar there :val_loss
but I have reported this metric in the checkpoint not in the logger..
I have some info that I wouldn't like to post here (due to security reasons), is there a way to share the link only with your user ? 🙂
But I can add screenshots of the log file if necessary
In the child dataset task I see the following :
ARTIFACTS - > STATE:Dataset state
Files added/modified: 1 - total size 518.78 MB
Current dependency graph: {
"0385db....": [],
()"94f4....": ["0385db..."]
}
child task is 94f4..
and parent task is "0385db..."
but what does the () line means?
And afterwards, I have the following output that continues for 300 iterations without further reports of metrics
In another task I have tried to evaluate this metric but received similar error :clearml.automation.optimization - WARNING - Could not find requested metric ('evaluate', 'val_loss') report on base task
Unfortunately, I am not running on a community server
CumbersomeCormorant74 As you can see in the attached - there were 2 experiments at the same time, but only one agent pulled the task, even though the second agent was free and listening to the queue.
where should I look to see this metric? at the scalars tab?
I will elaborate on the situation:
I have 2 agents - training and training_2. They are both listening to the services queue, but only 'training' pulls the tasks. At the beginning I had 2 tasks in the services queue. Then, agent 'training' pulled one and is currently executing it, but for some reason - it also pulled the 2nd task into it's queue, that is although this agent is not free and I have another agent which is: 'training_2'.
EDIT CostlyOstrich36
third image - cache after running another task with new cache file created even though cache is disabled
SuccessfulKoala55 I can't share the logs.
CostlyOstrich36 The application problem was indeed solved 🙂 but the plots one didn't
Hi, adding the requirements to the code doesn't help because the task fails beforehand. CostlyOstrich36
We have been trying to resolve the issue. I will comment here again if any more problems arise. Thanks!
I attached the log of the tensor board.
The scalar reported to the tf log is : 0.2631
The scalar reported to the scalars board in the UI is : 0.121
There is a major difference between the two
I had a task which I have cloned and reset a bunch of times, when I created the test as a new one, the error didnt appear again
AgitatedDove14 and what about 'epoch_loss' without validation? That's the scalar I am interested to understand where it comes from.. I thought that was just the loss reported at the end of the train epoch via tf
this is the correct file
I really like the first idea, but I already see a problem if I make changes to the file, I will need to re-upload it every time?
I am currently using the repo cache, but unfortunately every time I run a new task with an existing cached repo, a new cache file is created.. very weird behaviour which I have already described in previous threads ( https://clearml.slack.com/archives/CTK20V944/p1651393535530439?thread_ts=1651063782.712379&cid=CTK20V944 )
Because of a server error I can't download the log so I attached a screenshot. In the log I see only the following reports (without a summary table/plot).