Nice debugging experience
Kudos on the work !
BTW, I feel weird to add an issue on their github, but someone should, this generic setup will break all sorts of things ...
For setting trains-server I would recommend the docker-compose, it is very easy to setup, and you just need a single fixed compute instance, details https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md With regards to the "low prio clusters", are you asking how they could be connected with the trains-agent or if running code that uses trains will work on them?
Hmm you either need to run with SUDO or make sure the running user has docker run permissions
I'll try to go with this option, I think its actually perfect for my needs
Great!
iβm just curious about how does trains server on different nodes communicate about the task queue
We start manual, we tell the agent just execute the task (notice we never enqueued it), if all goes well we will get to multi-node part π
Interesting, do you think you could PR a "fixed" version ?
https://github.com/allegroai/clearml-web/blob/2b6aa6043c3f36e3349c6fe7235b77a3fddd[β¦]app/webapp-common/shared/single-graph/single-graph.component.ts
- This then looks for a module called
foo, even though itβs just a namespaceI think this is the issue, are you using python package name spaces ?
(this is a PEP feature that is really rarely used, and I have seen break too many times)
Assuming you have fromfrom foo.mod importwhat are you seeing in pip freeze ? I'd like to see if we can fix this, and better support namespaces
The problem is, the configuration is loaded at import time, so there is no "time" to pass anything other than environment variable.
That said if the only difference is server config you can useTask.set_credentials
Okay this seems correct...
Can you share both yaml files (server & serving) and env file?
Hi RattySeagull0
I'm trying to execute trains-agent in docker mode with conda as package manager, is it supported?
It should, that said we really do not recommend using conda as package manager (it is a lot slower than pip, and can create an environment that will be very hard to reproduce due to internal "compatibility matrix" of conda, that might be changing from one conda version to another)
"trains_agent: ERROR: ERROR: package manager "conda" selected, but 'conda' executable...
Merged, is it working for you now?
Yep, everything (both conda and pip)
The only downside is that you cannot see it in the UI (or edit it).
You can now do:data = {'datatask': 'idhere'} task.connect(data, 'DataSection')This will create another section named "DataSection" on the configuration tab. then you will be able to see/edit the input Task.id
JitteryCoyote63 what do you think?
I see TrickyFox41 try the following:--args overrides="param=value"Notice this will change the Args/overrides argument that will be parsed by hydra to override it's params
like this.. But when I am cloning the pipeline and changing the parameters, it is running on default parameters, given when pipeline was 1st run
Just making sure, you are running the cloned pipeline with an agent. correct?
What is the clearml version you are using?
Is this reproducible with the pipeline example ?
RobustRat47 I think you have to use the latest clearml package for that (1.6.0)
SmarmySeaurchin8 it could be a switch, the problem is that when you have automatic stopping flows, they will abort a task, which is legitimate (e.g. should not considered failed)
How come you have aborted tasks in the pipeline ? If you want to abort the pipeline, you need to first abort the pipeline Task then the tasks themselves.
is there a way that i can pull all scalars at once?
I guess you mean from multiple Tasks ? (if so then the answer is no, this is on a per Task basis)
Or, can i get experiments list and pull the data?
Yes, you can use Task.get_tasks to get a list of task objects, then iterate over them. Would that work for you?
https://clear.ml/docs/latest/docs/references/sdk/task/#taskget_tasks
Also SoreDragonfly16 could you test with if the issue exists with trains==0.16.2rc0 ?
Hi @<1801424298548662272:profile|ConvolutedOctopus27>
I am getting errors related to invalid git credentials. How do I make sure that it's using credentials from local machine?
configure the git_user/git_pass (app key) inside your clearml.conf on the machine with the agent:
None
Any chance you can test with the latest RC ? 1.8.4rc2
@<1523707653782507520:profile|MelancholyElk85> what are you trying to change ? maybe there is a better way?
BTW: if you do step_base_task.export_task() you can use the parts that you need in the dict and pass them to:task_overrides argument in add_step (you might need to flatten the nested arguments with '.' , and thinking about it, maybe we should do that automatically?!)
I think for it to work you have to have ssh running on the host machine (the socket client itself), no?
of what task? i'm running lots of them and benchmarking
If you are skipping every installation it should be the same
because if you set CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1 it will not install Anything at all
This is why it's odd to me...
wdyt?
DilapidatedDucks58 use a full link , without the package namegit+
With pleasure π
However, there is still a delay of approximately 2 minutes between the completion of setup,
Where is that delay in the log?
(btw: it seems your container is missing clearml-agent & git, installing those might add some time)
Hi SmarmyDolphin68
See some details here:
https://allegro.ai/docs/deploying_trains/trains_server_config/#network-and-security
Basically get an Azure load-balancer, it can also add the https on top of the http connect.
Check the details on load-balancers here
https://allegro.ai/docs/deploying_trains/trains_server_config/#sub-domains-and-load-balancers
I think this is the one:
https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview
@<1523704157695905792:profile|VivaciousBadger56>
Is the idea here the following? You want to use inversion-of-control such that I provide a function
f
to a component that takes the above dict an an input. Then I can do whatever I like inside the function
f
and return a different dict as output. If the output dict of
f
changes, the component is rerun; otherwise, the old output of the component is used?
Yes exactly ! this way you...