Reputation
Badges 1
53 × Eureka!My questions are:
- how can I avoid creating tens of new cache files?
- do you happen to know why this lock is created and how it is connected to the above error (in the link - regarding "failing to clone.. ")
We have been trying to resolve the issue. I will comment here again if any more problems arise. Thanks!
CostlyOstrich36 The application problem was indeed solved 🙂 but the plots one didn't
Sending you to private CostlyOstrich36
EDIT
I have disabled VCS-cache and it seems that the multiple cache files are still created when running a new task. Also still the lock is created once a new experiment is run: first image - cache after removing lock, second image - a few seconds later after running a new task. Also attached log output of the task uploaded (with ### replacing non relevant details).
I had a task which I have cloned and reset a bunch of times, when I created the test as a new one, the error didnt appear again
Hi, adding the requirements to the code doesn't help because the task fails beforehand. CostlyOstrich36
SuccessfulKoala55 I can't share the logs.
AgitatedDove14 and what about 'epoch_loss' without validation? That's the scalar I am interested to understand where it comes from.. I thought that was just the loss reported at the end of the train epoch via tf
I will elaborate on the situation:
I have 2 agents - training and training_2. They are both listening to the services queue, but only 'training' pulls the tasks. At the beginning I had 2 tasks in the services queue. Then, agent 'training' pulled one and is currently executing it, but for some reason - it also pulled the 2nd task into it's queue, that is although this agent is not free and I have another agent which is: 'training_2'.
Just to clarify again - when I start the agents I run :clearml-agent daemon --detached --queue training
and then: clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
This is why there are 'training' and 'training_2' queues.
Does this relate to the error below? from reading the issue I didn't see anyone mentioning this error -clearml_agent: ERROR: Failed cloning repository.1) Make sure you pushed the requested commit:(repository=' https://gitlab.com/data_science_team/____ ', branch='ilanit', commit_id='b5c___', tag='', docker_cmd='ubuntu:18.04', entry_point='training/____.py', working_dir='src')2) Check if remote-worker has valid credentials [see worker configuration file]
CumbersomeCormorant74 As you can see in the attached - there were 2 experiments at the same time, but only one agent pulled the task, even though the second agent was free and listening to the queue.
Thanks! the second link is exactly that what I was looking for 🙂
This is not something that we defined or created- if I understand your question. It is created once a ClearML task is run, and there until the lock is deleted (which is something we do to handle another error I posted here about)
I don't know if it has anything to do with it but I now see that the repo which is cloned and save in the cache is actually a different branch than the one which is pulled by the agent.
From the log file:repository = https://gitlab.com/data_science_team/PROJECT_NAME branch = MYBRANCHSOMENUMBER IP### DEBUG
warning: redirecting to https://gitlab.com/data_science_team/PROJECT_NAME.git/
Hi, the URL contains some details which I wouldn't like to share on this thread. Can I send it to one of you in private message?
In the child dataset task I see the following :
ARTIFACTS - > STATE:Dataset stateFiles added/modified: 1 - total size 518.78 MBCurrent dependency graph: {"0385db....": [], ()"94f4....": ["0385db..."]}
child task is 94f4.. and parent task is "0385db..." but what does the () line means?
also tried to connect from to dataset from CLI and recieved connection error:
What will happen if I disable the cache? Is there a way to find out which experiment is hung and why? in order to avoid this?
As it seems torch.save() just saves to the disk, so it seems there is no need for (server) parent path, just the name of the file , in my case : ' http://my_model.pt ' .
Thanks for you help CostlyOstrich36
And afterwards, I have the following output that continues for 300 iterations without further reports of metrics
Clearing my cookies solved the issue, Thanks 🙂
Another question: Is there a way to group together Dataset tasks (i.e redefine their parent) after the tasks have been finalized? In the same context: is there a way to change the dependency graph in the clearml dashboard after the task creation and finalization ?
where should I look to see this metric? at the scalars tab?