Reputation
Badges 1
533 × Eureka!What do you mean by submodules?
She did not push, I told her she does not have to push before executing as trains figures out the diffs.
When she pushes - it works
actually i was thinking about model that werent trained uaing clearml, like pretrained models etc
and the machine I have is 10.2.
I also tried nvidia/cuda:10.2-base-ubuntu18.04 which is the latest
I'll check the version tomorrow, about the current_task call, I tried before and after - same result
Okay SuccessfulKoala55 , problem solved! Indeed the problem was that there is not .git
folder. I updated necessary things to make the checkout action get the actual repo and now it works
It wasn't really clear to me what "standalone" means, maybe it will be better to add to the error
Error: Standalone
(no .git folder found)
script detected 'tasks/hp_optimization.py', but no requirements provided
That's awesome, but my problem right now is that I have my own cronjob deleting the contents of /tmp
each interval, and it deletes the cfg files... So I understand I must skip deleting them from now on
So how do I solve the problem? Should I just relaunch the agents? Because they can't execute jobs now
the ability to exexute without an agent i was just talking about thia functionality the other day in the community channel
Okay so regarding the version - we are using 1.1.1
The thing with this error it that it happens sometimes, and when it happens it never goes away...
I don't know what causes it, but we have one host where it works okay, then someone else checks out the repo and tried and it fails for this error, while another guy can do the same and it will work for him
Thx DangerousDragonfly8 💪
I mean if I continue and build on the example in the docs, what will happen if the training
task is completed, and then I get it and log to it? Will it be defined as running again?
the Task
object has a method called Task.execute_remotely
Look it up here:
https://allegro.ai/docs/task.html#trains.task.Task.execute_remotely
If you want we can do live zoom or something so you can see what happens
-_- why there isn't a link to source on the docs?
Manual model registration?
I manually deleted the allegroai/trains:latest
image, that didn't help either
what if i want it to use ssh creds?
cool, didn't know about the PAT
Is there a more elegant way to find the process to kill? Right now I'm doing pgrep -af trains
but if I'll have multiples agents, I will never be able to tell them apart
So if I'm collecting from the middle ones, shouldn't the callback be attached to them?
I'll check if this works tomorrow