Reputation
Badges 1
100 × Eureka!Yes. More exactly I'm gzip.open them but I don't believe it should matter
I am not sure what you mean by verifying the API.
But maybe only one step in the dag is flawed and I want to continue the rest of the pipeline as usual (despite the branch of the flawed task).
I am not sure what you mean by automatic stopping flows, could you give an example?
Brutal sudo reboot, the agent is not up anymore
This is the path:
/Remote/moshe/Experiments/trains_bs_pipe_new/ypi/OKAY/Try_That/baseline/evaluation_validation/results/images/bottom_scores/0.0_slot02_extracted_23_01__1035__1.png
Oh I see, I think this will work. Thanks 🙂
I see, will keep that in mind. Thanks Martin!
Actually two machines with shared filesystem
I've investigated it some more, It isn't path related as far as I can tell, as these same paths worked 2 weeks ago and a normal path doesn't work now
otherwise if you empty the installed packages and the requirements.txt is in one of the parents folder of the files that ran trains should detect it automatically
You can try copying all the contents of requirements.txt to the installed packages tab in the trains dashboard of your experiment (in the UI)
I'm confused. Why would that matter what my local code is when trying to replicate an already ran experiment?
Also, between which files is the git diff performed? (I've seen the linediff --git a/.../run.py b/.../run.py
but I'm not sure what's a and what's b in this context)
Sure, but before that, it seems that the script path parameter (which I think you refer to as entry_point) is not relative to the base of the repo, as I expected it to be, could that interfere?
Yeah I understand that. But since overriding parameters of pre executed Tasks is possible, I was wondering if I could change the commit id to the current one as well.
What do you mean by execute remotely? (I didn't really understand this one from the docs)
On another topic, I've just now copied a Task that ran successfully yesterday and tried to run it. It failed to run and I got aERROR! Failed applying git diff, see diff above.
Why is that?
I understand how this is problematic. This might require more thinking if you guys wish to support this.
But it still doesn't answer one thing, why when I cloned a previously successful experiment, it failed on git diff?
Furthermore, let's say I have 6 GPUs on a machine, and I'd like trains to treat this machine as 2 workers (gpus 0-2, 3-5), is there a way to do that?
Something else, If I want to designate only some of the GPUs of a worker, how can I do that?
Hey, I've gotten this message:
TRAINS Task: overwriting (reusing) task id=24ac52461b2d4cfa9e672d9cd817962c
And I'm not sure why it's reusing the task instead of creating a new task id, the configuration was different although the same python file run. Have you got any idea?