Hi @<1578555761724755968:profile|GrievingKoala83>
Two tasks are created, but the training does not begin, both tasks are in perpetual running.
Can you print something after the task.launch_multi_node(args.nodes))
- I'm assuming the two Tasks are running and are blocked on the " Trainer
" class
If specified
args.gpus=2
and args.nodes=2,
three
tasks are created.
This is really odd, can you add some prints with task id and rank after the ...
Any chance you can share the Log?
(feel free to DM it so it will not end up public)
your account has 2FA enabled and you must use a personal access token instead of a password.
I'm assuming you have created the personal access token and used it, not the pass
- This then looks for a module called
foo
, even though it’s just a namespaceI think this is the issue, are you using python package name spaces ?
(this is a PEP feature that is really rarely used, and I have seen break too many times)
Assuming you have fromfrom foo.mod import
what are you seeing in pip freeze ? I'd like to see if we can fix this, and better support namespaces
LovelyHamster1 what do you mean by "assume the permissions of a specific IAM Role" ?
In order to spin an ec2 instance (aws autoscaler) you have to have correct credentials, to pass those credentials you must create a key/secret pair to pass to the autoscaler. There is no direct support for IAM Role. Make sense ?
I use Yaml config for data and model. each of them would be a nested yaml (could be more than 2 layers), so it won't be a flexible solution and I need to manually flatten the dictionary
Yes, you are correct, the recommended option would be to store it with task.connect_configuration
it's goal is to store these types of configuration files/objects.
You can also store the yaml file itself directly just pass Path object instead of dict/string
Was I right to put the credentials in
clearml.conf
on the machine I am starting the agent on?
AdventurousButterfly15 Yes exactly!
you should be able to see that in the log of the Task (at the top of the log there will be the entire configuration), can you see the git user there?
Yes EnviousStarfish54 the comparison is line by line and compared only to the left experiment (like any multi comparison, you have to set the baseline, which is always the left column here, do notice you can reorder the columns and the comparison will be updated)
BTW:str('\.') Out[4]: '\\.' str(('\.', )) Out[5]: "('\\\\.',)"
This is just python str casting
Hi ResponsiveCamel97
Let me explain how it works, essentially it creates a new venv inside the docker, inheriting all the packages form the main system packages.
This allows it to use the installed packages if the version match, and upgrade/change if you need, all without the need to rebuild a new container. Make sense ?
Nice! I'll see if we can have better error handling for it, or solve it altogether 🙂
And as far as I can see there is no mechanism installed to load other objects than the model file inside the Preprocess class, right?
Well actually this is possible, let's assume you have another Model that is part of the preprocessing, then you could have:
something like that should work
def preprocess(...)
if not getattr(self, "_preprocess_model):
self._preprocess_model = joblib.load(Model(model_id).get_weights())
PungentLouse55 I'm checking something here, you might stumbled on a bug in parameter overriding. Updating here soon ...
if project_name is None and Task.current_task() is not None: project_name = Task.current_task().get_project_name()
This should have fixed it, no?
I can then programmatically choose which file to import with importlib. Is there a way to tell clearml programmatically to analyze the files, so it can built up the requirements correctly?
Sadly no 😞
It analyzes the running code, then if it decides it is not a self contained script it will analyze the entire repo ...
I just saw that
Task.create
takes
Task.create
is Not Task.init. It is meant to allow you to create new Tasks (think Jobs) from ...
GiganticTurtle0 can you please add a github issue with feature request to clearml-agent? I think this is a great use case!
and the clearml server version ?
What do you have in the .netrc in the machine
section
TrickySheep9 is this a conda package or a wheel you are installing manually ?
Hi UnevenDolphin73
If you "remove" the lock file the agent will default to pip.
You can hack it with uncommitted changes section?
Thanks for the details TroubledJellyfish71 !
So the agent should have resolved automatically this line:torch == 1.11.0+cu113
into the correct torch version (based on the cuda version installed, or cpu version if no cuda is installed)
Can you send the Task log (console) as executed by the agent (and failed)?
(you can DM it to me, so it's not public)
Hmm I just tested on the community version and it seems to work there, Let me check with frontend guys. Can you verify it works for you on https://app.community.clear.ml/ ?
LOL AlertBlackbird30 had a PR and pulled it 🙂
Major release due next week after that we will put a a roadmap on the main GitHub page.
Anything specific you have in mind ?
I love the new docs layout!
Thank you and thank docusaurus, they rock!
Hi DepressedChimpanzee34
How do I reproduce the issue ?
What are we expecting to get there ?
Is that a Colab issue or hyper-parameter encoding issue ?
And when retrieve just this file? is it working ?
(Maybe for some reason the file is corrupted) ?
Is this consistent on the same file? can you provide a code snippet to reproduce (or understand the flow) ?
Could it be two machines are accessing the same cache folder ?