Reputation
Badges 1
94 × Eureka!ok, I'll try 🙂
AgitatedDove14 shouldn't it bewhile not an_optimizer.wait(timeout=1.0):instead ofwhile an_optimizer.wait(timeout=1.0):in the first code block?
Can I do this to specify which worker should execute that task?CLEARML_WORKER_NAME=<worker_name> clearml-agent execute --id <task_id>
There is a git repo 🙂 my question was to clarify if I understand well. Thank you for response :)
No. Hovewer, I see some of running agents, but not all
AgitatedDove14 do you know if it possible not to open ports on machines B_i where agents reside?
SuccessfulKoala55 hmm, we are trying to do something like that and we are encountering problems. We are doing big hyperparameter optimization on 200 workers and some tasks are failing (while with less workers they are not failing). Also, UI also has some problems with that. Maybe there are some settings that should be corrected in comparison to classic configuration?
version 1.8.1
No, there are no error messages. The behaviour is just very strange (or even incorrect)
Suppose that this is a task that is cloned:
` base_task = replacement_task.create_function_task(
func=some_func, # type: Callable
func_name=f'func_id_run_me_remotely_nr', # type:Optional[str]
task_name=f'a func task', # type:Optional[str]
# everything below will be passed directly to our function as arguments
some_argument=message,
some_argument_2=message,
rand...
AgitatedDove14 suppose that we are doing some optimization task (parameter search). This is a task where generally we want to minimize some metric m , but it will be enough to have, say 3 occurences when m<THRESHOLD and when it will happen, we stop the search (and free the resources, that can be needed for some further step)
AgitatedDove14 in fact in our case we want to use simple strategies, RandomSearch is enough, but the problem is that we need to change the ranges dynamically
In fact, as I assume, we need to write our custom HyperParameterOptimizer, am I right?
SuccessfulKoala55 We are encountering some strange problem. We are spinning N agents using script, in a loop
But not all agents are visible as workers (we check it both in UI, but also running workers_list = client.workers.get_all() ).
Do you think that is it possibility that too much of them are connecting at once and we can solve that by setting a delay between running subsequent agents?
Because it has no coincidence with some specific actions
SuccessfulKoala55 So, we have two problems:
Probably minor one, but strange. We run some number of workers using given compose file, that is attached in .zip. We can do:docker compose -f docker-compose-worker.yaml build docker compose -f docker-compose-worker.yaml upand in theory there should be 10 agents running, but frequently, not 10 are shown in UI (for example on last run we got 3 of them). When we run htop , we can see 10 agents in our system. What is even more strange, those...
Yes, thank you! 🙂
btw. why do I need to give my git name/pass to run it if I serve an agent from local?
SuccessfulKoala55 could we run a server with some verbose logging?
SuccessfulKoala55 How should I pass this variable? Do I need to create a file apiserver.conf in folder /opt/clearml/config and write there just CLEARML_USE_GUNICORN=1 . Do I need to restart a server after that?
Do we even have an option to assign id to each agent? https://clear.ml/docs/latest/docs/clearml_agent/clearml_agent_daemon
I am using UI and I am clicking select all. If it is calling API server then yes
building from code: pipe.add_step()2. not locally, but also not with services queuepipe.set_default_execution_queue(DEFAULT_EXECUTION_QUEUE)
Is there a need to use just services queue?
So seems like this dictionary works with strings
Yes, it is a good reason 🙂
Do you maybe know a tool that measures that during execution (to avoid looking on nvidia-smi during all training)?
So, suppose, that a task T uses 27% of GPU, means, that we can spawn 3 agents on this GPU (suppose that we will give them only task T). Does it make sense?
Hmm, it is hard to specify the way
SuccessfulKoala55 Thank you for the response! Let me elaborate a bit to check if I understand this correctly.
We have a time-consuming task T based on optimization for parameters. We want to run hyperparameter optimization for T, suppose that we want to run it for 100 sets of parameters.
We want to leverage the fact that we have n machines to make the work parallel.
So for that we use https://clear.ml/docs/latest/docs/references/sdk/hpo_optimization_hyperparameteroptimizer/ , we run Agent...