Reputation
Badges 1
25 × Eureka!Regarding this, does this work if the task is not running locally and is being executed by the trains agent?
This line: "if task.running_locally():" makes sure that when the code is executed by the agent it will not reset it's own requirements (the agent updates the requirements/installed_packages after it installs them from the requiremenst.txt, so that later you know exactly which packages/versions were used)
Regrading the missing packages, you might want to test with:force_analyze_entire_repo: false
https://github.com/allegroai/trains/blob/c3fd3ed7c681e92e2fb2c3f6fd3493854803d781/docs/trains.conf#L162
Or if you have a full venv you like to store instead:
https://github.com/allegroai/trains/blob/c3fd3ed7c681e92e2fb2c3f6fd3493854803d781/docs/trains.conf#L169
BTW:
What is the missed package?
CooperativeFox72 btw, are you guys running those 20 experiments manually or through trains-agent ?
CooperativeFox72 of course, anything trains related, this is the place 🙂
Fire away
Sure, ReassuredTiger98 just add them after the docker image in the "Base Docker image" section under the execution Tab. The same applies for setting it from code.
example:nvcr.io/nvidia/tensorflow:20.11-tf2-py3 -v /mnt/data:/mnt/data
You can also always force extra docker run arguments by changing the clearml.conf on the agent itself:
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L121
LOL I see a meme waiting for GrumpyPenguin23 😉
How does it work with k8s?
You need to install the clearml-glue and them on the Task request the container, notice you need to preconfigure the clue with the correct Job YAML
This is what I think you should end up withDiscreteParameterRange('General/dataset_url', values=["option 1 for url", "option 2 for url"])
If args['dataset_url']
is a list, you should just do values=args['dataset_url']
What exactly do you mean by docker run permissions?
Yey! okay let me make sure we add this feature to the Task.init arguments so one can control it from code 🙂
should be the full path, or just the file name?
just file name, this is basically fname matching
So obviously that is the problem
Correct.
ShaggyHare67 how come the "installed packages" are now empty ?
They should be automatically filled when executing locally?!
Any chance someone mistakenly deleted them?
Regrading the python environment, trains-agent
is creating a new clean venv for every experiment, if you need you can set in your trains.conf
:agent.package_manager.system_site_packages: true
https://github.com/allegroai/trains-agent/blob/de332b9e6b66a2e7c67...
Yes, this seems like the problem, you do not have an agent (trains-agent) connected to your server.
The agent is responsible for pulling the experiments and executing them.pip install trains-agent trains-agent init trains-agent daemon --gpus all
you should see your agent there
GrievingTurkey78
maybe since the package is not directly imported in my code it is possible to get a different version to what I have locally (?).
If these are derivative packages (i.e. imported by other packages) they are not automatically logged when executing the Task manually (in order to keep the "installed packages as lean as possible on the one hand but specify also specify the important packages for you)
That said, when the "trains-agent" executed the task it will store nack...
Hi BattyLizard6
does clearml orchestration have the ability to break gpu devices into virtual ones?
So this is fully supported on A100 with MIG slices. That said dynamic multi-tenant GPU on Kubernetes is a Kubernetes issue... We do support multi agents on the same GPU on bare metal, or over shared GPU instances over k8s with:
https://github.com/nano-gpu/nano-gpu-agent
https://github.com/intel/intel-device-plugins-for-kubernetes/tree/main/cmd/gpu_plugin#fractional-resources
http...
They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
You should not have this entry in the conf file, the "worker_id" should be unique (and is based on the "worker_name" as a prefix. You can control it via env variales:CLEARML_WORKER_ID
What sort of data would be stored in the
venvs-build
folder?
ClumsyElephant70 temporary (lifetime of the task execution) virtual environment, including the code etc. It is deleted and recreated for every new task launched (or restored from cache, if venvs_cache is enabled)
WobblyCrab70 sure, put a load-balancer in between, AWS has a solution for that basically use the AMI from the GitHub and ask IT to add https on the 8080/8008/8081 ports
That should not be complicated to implement. Basically you could run 'clearm-task execute --id taskid' as the sagemaker cmd. Can you manually launch it on sagemaker?
Basic setup:
glues service per "job template" (e.g. k8s resources, for example cpu requirement, or gpu requirement).
queue per glue service, e.g. cpu_machine
queue, and 1xGPU
queue
wdyt?
clearml python version: 1.91
could you upgrade to 1.9.3 and try?
Minio is on the same server and the 9000 and 9001 ports are open for tcp
just to be clear, the machine that runs your clearml code can in fact access the minio on port 9000 ?
I tested with the latest and everything seems to work as expected.
BTW: regrading "bucket-name" , make sure it complies with the S3 standard, as a test try to change it to just "bucket" bi hyphens
It should be under script.diff:'script': {'binary': '', 'repository': '', 'tag': '', 'branch': '', 'version_num': '', 'entry_point': '', 'working_dir': '', 'requirements': {'pip': ''}, 'diff': ''}
For some reason this is empty in your case, are you seeing it in the UI?
If you are querying the current task (i.e. running) it might not be there yet.
You can call this internal function that returns only after the repo detection is done.task._wait_for_repo_detection()
Could not find a version that satisfies the requirement open3d==0.15.2 .. from versions: 0.10.0.0, 0.11.0, 0.11.1, 0.11.2, 0.12.0, 0.13.0)
This points to the agent installing using a different python version that you run the original code, I would guess python3.6
It should have worked....
Can you run the examples from the repo and see if they work?
GrievingTurkey78 MagnificentSeaurchin79 do you guys want to start a PR branch we cal all work on it?
Your git execution needs this file, just like your machine does, to know where the server is and how to authenticate. You have to Manually pass it to your git action.
Can you let me know if i can override the docker image using template.yaml?
No, you cannot.
But you can pass OS environment "CLEARML_DOCKER_IMAGE" to set a diff default one