
Reputation
Badges 1
100 × Eureka!AgitatedDove14 Quite hard for me to try this right now. but I've validated that the relevant code segments are untouched between the versions. (at least current master branch at the ClearML repo)
Well, it's making more sense but still quite ugly, hehe
it uses the api credentials generated by the trains dashboard
This is the path:
/Remote/moshe/Experiments/trains_bs_pipe_new/ypi/OKAY/Try_That/baseline/evaluation_validation/results/images/bottom_scores/0.0_slot02_extracted_23_01__1035__1.png
I've ran this 8 times:trains-agent --config-file /opt/trains/trains.conf daemon --detached --cpu-only --queue important_cpu_queue cpu_queue
The version is 0.16.2rc0 (a version Mushik gave me that supports local conda env)
Hmm, I've changed my trains-server config location to use a config in a different location, and successfully set up in the second server the trains-agent. But I don't see any new worker created, why is that?
I think it should be treated as failed, I am truly not convinced as why aborting a task should be anything beside a user terminating an unwanted behavior of the task (be it bug, running with wrong config, task getting stuck etc..)
SuccessfulKoala55 I found the temp files, they contain the supposedly worker id, which seems just fine
For example HPO, early stopping. It would mark the Task as aborted
Why? The task should have completed successfully, how is this aborting?
If I change the file at the entry point (let's say, I delete all of its content), how will trains behave when I try to clone and execute such task?
I will try that.
In addition, I've seen that the file location of a task is saved, does it mean that when rerunning said task (for example clone it and enqueue it) trains will search for the file in the stored location? Or will it clone the repo with the given commit id and use the relative path to find this file?
On another topic, I've just now copied a Task that ran successfully yesterday and tried to run it. It failed to run and I got aERROR! Failed applying git diff, see diff above.
Why is that?
AgitatedDove14
The easiest example for such use case as I describe is for example trying to run the full pipeline but in this experiment I wish to try Batch Norm which I haven't used in the pre executed Task. How can I do that without running this Task by it's own? (Which is quite problematic for me since it runs as a part of a pipeline, therefore using DAG)
Hey, I've gotten this message:
TRAINS Task: overwriting (reusing) task id=24ac52461b2d4cfa9e672d9cd817962c
And I'm not sure why it's reusing the task instead of creating a new task id, the configuration was different although the same python file run. Have you got any idea?
Since my servers have a shared file system, the init process tells me that the configuration file already exists. Can I tell it to place it in another location? GrumpyPenguin23
I'm confused. Why would that matter what my local code is when trying to replicate an already ran experiment?
Also, between which files is the git diff performed? (I've seen the linediff --git a/.../run.py b/.../run.py
but I'm not sure what's a and what's b in this context)
Found it in the init docs 🙂