Reputation
Badges 1
25 × Eureka!packages are updated, and I don't know which python version I get, + changing the python version of the OS is not really recommended
Wait I'm confused, this is inside a container, no?
and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
Generally speaking you are correct, but some packages will not have the same version for all python versions
Specifically in this case I think...
Hmm check if this one works:optimizer._get_child_tasks_ids( parent_task_id=optimizer._job_parent_id or optimizer._base_task_id, order_by=optimizer._objective_metric._get_last_metrics_encode_field(), additional_filters={'page_size': int(top_k), 'page': 0})If it does, let's PR it as a dedicated function
I figured out the problem...
Nice!
Unfortunately, the hyperparameters in configuration object seems to be superior to the hyperparameters in Hyperparameter section
Hmm what do you mean by that ? how did you construct the code itself? (you should be able to "prioritize" one over the over)
Regarding this, does this work if the task is not running locally and is being executed by the trains agent?
This line: "if task.running_locally():" makes sure that when the code is executed by the agent it will not reset it's own requirements (the agent updates the requirements/installed_packages after it installs them from the requiremenst.txt, so that later you know exactly which packages/versions were used)
And when retrieve just this file? is it working ?
(Maybe for some reason the file is corrupted) ?
By the way, will downloading still happen if the datasets is available in the cache folder?
If it is cached, then there is no need to re-download π
Hi @<1576381444509405184:profile|ManiacalLizard2>
Yeah that should work, assuming credentials are set in your clearml.conf
RC is out, SmugSnake6 please try withpip install clearml==1.7.2rc1
DepressedChimpanzee34 what would be easier curl or python ?
I think CostlyOstrich36 managed to reproduce?!
WickedGoat98 sure that will not be complicated:
try something along the lines of :agent: networks: - backend container_name: clearml-agent image: allegroai/clearml-agent:latest restart: unless-stopped privileged: true environment: CLEARML_HOST_IP: ${CLEARML_HOST_IP} CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-} CLEARML_API_HOST: `
CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-}
CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-}
...
I want to keep the above setup, the remote branch that will track my local will be onΒ
fork
Β so it needs to pull from there. Currently it recognizesΒ
origin
Β so it doesnβt work because the agent then canβt find the commit.
So you do not want to push the change set ?
You can basically add the entire change set (uncomitted changes) from the last pushed commit).
In your clearml.conf, set store_code_diff_from_remote: true
https://github.com/allegroai...
Since pytorch is a special example (the agent will pick the correct pytorch based on the installed CUDA) , the agent will first make sure the file is downloaded, and then pass the resolving for pip to decide if it necessary to install. (bottom line, we downloaded the torch for no reason but it is cached so no real harm done) It might be the second package needs a specific numpy version... this resolving is don't by pip, not the agent specifically. Anyhow --system-site-packages is applicable o...
Hi @<1562973095227035648:profile|ThoughtfulOctopus83>
The host should be just the host name, no https prefix, I'm assuming that's the issue
I think prefix would be great. It can also make it easier for reporting scalars in general
Actually those are "supposed" to be collected automatically by pytorch and reported by the master node.
currently we need a barrier to sync all nodes before reporting a scalar which makes it slower.
Also "should" be part of pytorch ddp
It's launched with torchrun
I know there is an integration with torchrun (the under the hood infrastructure) effort, I'm not sure where it stands....
Hi UnsightlySeagull42
Basically you can get the agent to always add additional arguments for the docker run, such as -v for mounting:
https://github.com/allegroai/clearml-agent/blob/948fc4c6ce1ecf33a74619ad570d69b8188f6db9/docs/clearml.conf#L133
Oh, yes, that might be (threshold is 3 minutes if no reports) but you can change that:task.set_resource_monitor_iteration_timeout(seconds_from_start=10)
what do you mean? the same env for all components ? if they are using/importing exactly the same packages, and using the same container, then yes it could
LazyTurkey38 , ohh I think you are correct π
it should be:# patch the Task and actually send it for execution if Task.running_locally(): # this will verify all auto repo detection and python is done. task.close() # so that we can edit the task task.reset() # update the repo task.update_task(task_data={'script': {'branch': 'new_branch', 'repository': 'new_repo'}}) # now to actually enqueue the Task Task.enqueue(task, queue_name='default')wdyt?
It should preserve the order as the order of the update back (i.e. when executed by the agent) is the same as the order of the keys (obviously py3.7+ becuase it creates dict not Ordered Dicts)
multiple machines and reporting to the same task.
Out of curiosity , how do you launch it on multiple machines?
reporting to the same task.
So the "funny" think is, they all report on on top (overwriting) the other...
In order for them to report individually, it might be that you need multiple Tasks (i.e. one per machine)
Maybe we could somehow have prefix with rank on the cpu/network etc?! or should it be a different "title", wdyt?
SmarmySeaurchin8 regarding the original question:task.set_project(project_id)Task.get_projects() to get all the project names/ids
OddAlligator72 what you are saying is, take the repository / packages from the runtime, aka the python code calling the "Task.create(start_task_func)" ?
Is that correct ?
BTW: notice that the execution itself will be launched on other remote machines, not on this local machine
You actually have to login/ssh under said user, have another dedicated mountpoint and spin the agent from that user.
This is odd it says 1.0.0 but then, it was updated t weeks ago ...
@<1657918706052763648:profile|SillyRobin38> out of curiosity did you compare performance of tensorrt-llm vs vllm ?
(the jury is still out on that, just wondered if you had a chance)
Why does my task execution freeze after pip installation (running agent in foreground mode)?
Hi AdventurousButterfly15
Are you running in agent docker mode or venv mode ?
What do you mean freeze? do you see anything on the Taks console log in the UI? what's the host OS ?
yea, does the enterprise version have more functionality like this?
yes, all sorts of bit and pieces for easier DevOps / K8s etc.
I got everything working using the default queue. I can submit an experiment, and a new GPU node is provisioned, all good
Nice!
My next question, how do I add more queues?
You can create new queues in the UI and spin a new glue for the queue (basically think of a queue as an abstraction for a specific type of resource)
Make sense ?