Reputation
Badges 1
533 × Eureka!I manually deleted the allegroai/trains:latest
image, that didn't help either
Very nice thanks, I'm going to try the SA server + agents setup this week, let's see how it goes ✌
Committing that notebook with changes solved it, but I wonder why it failed
Mmm maybe, lets see if I get this straight
A static artifact is a one-upload object, a dynamic artifact is an object I can change during the experiment -> this results at the end of an experiment in an object to be saved under a given name regardless if it was dynamic or not?
You should try trains-agent daemon --gpus device=0,1 --queue dual_gpu --docker --foreground
and if it doesn't work try quoting trains-agent daemon --gpus '"device=0,1"' --queue dual_gpu --docker --foreground
Okay SuccessfulKoala55 , problem solved! Indeed the problem was that there is not .git
folder. I updated necessary things to make the checkout action get the actual repo and now it works
It wasn't really clear to me what "standalone" means, maybe it will be better to add to the error
Error: Standalone
(no .git folder found)
script detected 'tasks/hp_optimization.py', but no requirements provided
Okay so regarding the version - we are using 1.1.1
The thing with this error it that it happens sometimes, and when it happens it never goes away...
I don't know what causes it, but we have one host where it works okay, then someone else checks out the repo and tried and it fails for this error, while another guy can do the same and it will work for him
Worth mentioning, nothing has changed before we executed this, it worked before and now after the update it breaks
I assume trains passes it as is, so I think the quoting I mentioned might work
Okay so that is a bit complicated
In our setup, the DSes don't really care about agents, the agents are being managed by our MLops team.
So essentially if you imagine it the use case looks like that:
A data scientists wants to execute some CPU heavy task. The MLops team supplied him with a queue name, and the data scientist knows that when he needs something heavy he pushes it there - the DS doesn't know nothing about where it is executed, the execution environment is fully managed by the ML...
thx TimelyPenguin76
skimming over this, I can't find how to filter by project name or something similar
I only found Project ID, which I'm not sure what this refers to - I have the project name
Hahahah thanks for the help SuccessfulKoala55 & CostlyOstrich36
I really do feel it would be a nice to have the ability to easily configure the Cleanup Service to cleanup only specific projects / tasks as its a common use case to have a project dedicated for debugging and alike
bottom line I want to edit the cleanup service code to only delete tasks under a specific project - how do I do that?
Example code? I didn't see anywhere an example of filtering using project name
I'll tr yto work with that
Does that mean that teh AWS autoscaler in trains, manages EC2 auto scaling directly without using the AWS built in EC2 auto scaler?
When I ran the clearml-task --name ... -project ... -script ....
it failed saying not requiremetns was found
I'm quite confused... The package is not missing, it is in my environment and executing tasks normally ( python my_script.py....
) works
I mean if I continue and build on the example in the docs, what will happen if the training
task is completed, and then I get it and log to it? Will it be defined as running again?