Reputation
Badges 1
533 × Eureka!AgitatedDove14 permanent. I want to start with a CLI interface that allows me add users to the trains server
What do you mean by submodules?
She did not push, I told her she does not have to push before executing as trains figures out the diffs.
When she pushes - it works
does the services mode have a separate configuration for base image?
AgitatedDove14 just so you'd know this is a severe problem that occurs from time to time and we can't explain why it happens... Just to remind, we are using a pipeline controller task, which at the end of the last execution gathers artifacts from all the children tasks and uploads a new artifact to the pipeline's task object. Then what happens is that Task.current_task()
returns None
for the pipeline's task...
TimelyPenguin76 this fixed it, using the detect_with_pip_freeze
as true
solves the issue
Okay Jake, so that basically means I don't have to touch any server configuration regarding the file-server
on the trains server. It will simply get ignored and all I/O initiated by clients with the right configuration will cover for that?
why not use my user and group?
I mean usually it would read if cached_file: return cached_file
I showed you this phenomenon in the UI photos in the other thread
Sorry.. I still don't get it - when I'm launching an agent with the --docker
flag or with the --services-mode
flag, what is the difference? Can I use both flags? what does it mean? π€
later today or tomorrow, I'll update
You should try trains-agent daemon --gpus device=0,1 --queue dual_gpu --docker --foreground
and if it doesn't work try quoting trains-agent daemon --gpus '"device=0,1"' --queue dual_gpu --docker --foreground
I guess what I want is a way to define environment variables in agents
Okay so that is a bit complicated
In our setup, the DSes don't really care about agents, the agents are being managed by our MLops team.
So essentially if you imagine it the use case looks like that:
A data scientists wants to execute some CPU heavy task. The MLops team supplied him with a queue name, and the data scientist knows that when he needs something heavy he pushes it there - the DS doesn't know nothing about where it is executed, the execution environment is fully managed by the ML...
But does it disable the agent? or will the tasks still wait for the agent to dequeue?
Do you have any idea as to why does that happen SuccessfulKoala55
In the larger context I'd look on how other object stores treat similar problems, I'm not that advanced in these topics.
But adding a simple force_download
flag to the get_local_copy
method could solve many cases I can think of, for example I'd set it to true in my case as I don't mind the times it will re-download when not necessary as it is quite small (currently I always delete the local file, but it looks pretty ugly)
I'll just exclude .cfg files from the deletion, my question is how to recover, must i recreate the agents or there is another way?
actually i was thinking about model that werent trained uaing clearml, like pretrained models etc
can't remember, I just restarted everything so I don't have this info now