Reputation
Badges 1
94 × Eureka!I am referring to something like Ray framework has https://docs.ray.io/en/latest/ray-core/tasks.html#specifying-required-resources
Do we even have an option to assign id to each agent? https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_daemon
SuccessfulKoala55 We are encountering some strange problem. We are spinning N agents using script, in a loop
But not all agents are visible as workers (we check it both in UI, but also running workers_list = client.workers.get_all()
).
Do you think that is it possibility that too much of them are connecting at once and we can solve that by setting a delay between running subsequent agents?
Yes, thank you! 🙂
AgitatedDove14 do I need to have the repo that I am running on my account? Even if it is public repo, like repo with your (clearml) examples:
SOURCE CODE
REPOSITORY
https://github.com/allegroai/clearml.git
BRANCH NAME
Latest in branch master
SCRIPT PATH
pytorch_matplotlib.py
WORKING DIRECTORY
examples/frameworks/pytorch
?
Hmm, it is hard to specify the way
Because it has no coincidence with some specific actions
or at least I can't specify such
building from code: pipe.add_step()
2. not locally, but also not with services
queuepipe.set_default_execution_queue(DEFAULT_EXECUTION_QUEUE)
Is there a need to use just services
queue?
The use case was that server with repo wasn't responding for a while and I was thinking how to solve that. Thanks for the answer!
Commits, that are not pushed to the repo
Yes, it is a good reason 🙂
Do you maybe know a tool that measures that during execution (to avoid looking on nvidia-smi
during all training)?
So, suppose, that a task T uses 27% of GPU, means, that we can spawn 3 agents on this GPU (suppose that we will give them only task T). Does it make sense?
version 1.8.1
No, there are no error messages. The behaviour is just very strange (or even incorrect)
Suppose that this is a task that is cloned:
` base_task = replacement_task.create_function_task(
func=some_func, # type: Callable
func_name=f'func_id_run_me_remotely_nr', # type:Optional[str]
task_name=f'a func task', # type:Optional[str]
# everything below will be passed directly to our function as arguments
some_argument=message,
some_argument_2=message,
rand...
So seems like this dictionary works with strings
Can I do this to specify which worker should execute that task?CLEARML_WORKER_NAME=<worker_name> clearml-agent execute --id <task_id>
SuccessfulKoala55 could we run a server with some verbose logging?
SuccessfulKoala55 hmm, we are trying to do something like that and we are encountering problems. We are doing big hyperparameter optimization on 200 workers and some tasks are failing (while with less workers they are not failing). Also, UI also has some problems with that. Maybe there are some settings that should be corrected in comparison to classic configuration?
SuccessfulKoala55 How should I pass this variable? Do I need to create a file apiserver.conf
in folder /opt/clearml/config
and write there just CLEARML_USE_GUNICORN=1
. Do I need to restart a server after that?
SuccessfulKoala55 we did it through default Docker-compose file.
If there a way to give more resources for server to help it somehow?
SuccessfulKoala55 Thank you for the response! Let me elaborate a bit to check if I understand this correctly.
We have a time-consuming task T based on optimization for parameters. We want to run hyperparameter optimization for T, suppose that we want to run it for 100 sets of parameters.
We want to leverage the fact that we have n machines to make the work parallel.
So for that we use https://clear.ml/docs/latest/docs/references/sdk/hpo_optimization_hyperparameteroptimizer/ , we run Agent...
AgitatedDove14 suppose that we are doing some optimization task (parameter search). This is a task where generally we want to minimize some metric m
, but it will be enough to have, say 3 occurences when m<THRESHOLD
and when it will happen, we stop the search (and free the resources, that can be needed for some further step)
Regarding this last question - I know that there is possibility to set up some budget - for example seconds of running after which optimization stops. But is there a possibility to specify a boolean condition when work should stop?
SuccessfulKoala55 thank you for the response; what about the second part of question (stopping)?
AgitatedDove14 one more question regarding this issue
Is it possible to change parameter space dynamically.
(dummy) example:
Our optimization is a task when we sample from [1,2,3] twice. At the situation when 3 is chosen twice, eliminate 3 from one sampling range, so make the sampling x1 from [1,2,3] and x2 from [1,2]