Reputation
Badges 1
25 × Eureka!The easiest is to pass an entire trains.conf
file
available agent, i.e. not running anything else.
I mean how long would instance 1 wait until instance 2 of the experiment is up and running?
In other words what happens of all the nodes/agents are working and we still "need" additional instance.
This is basically like "pre-allocating" the nodes, only they wait in real-time until the additional node joins them.
Agent A pulls the 3 node Task, the Task clones itself (Task B) and enqueues on "very high priory queue" Task A wait until Task B is ru...
I was trying to do exactly as you mentioned setting the environment variableΒ
before
Β any trains import but it didn't work
In your entry point script, (even if you do not call trains/ Task.init
) add:import os os.environ['TRAINS_CONFIG_FILE']='~/my_new_trains.conf' import trains
Then when you actually import trains, everything is already set and it will not read the configuration again.
Make sense ?
Hi SmallDeer34
The clearml-agent has its own cleaml.conf file , there you should put S3 credentials and they will be passed to any Task the agent executes:
https://github.com/allegroai/clearml-agent/blob/176b4a4cdec9c4303a946a82e22a579ae22c3355/docs/clearml.conf#L234
How so? they are in one place? the creation of the venv is transparent, and the packages that are there are everything you have in the docker, plus the ability to override them from the UI.
What am I missing here ?
SubstantialElk6 I just executed it , and everything seems okay on my machine.
Could you pull the latest clearml-agent from the github and try again ?
EDIT:
just try to run:git clone
cd clearml-agent python examples/k8s_glue_example.py
ShaggyHare67 I'm just making sure I understand the setup:
First "manual" run of the base experiment. It creates an experiment in the system, you see all the hyper parameters under General section. trains-agent
running on a machine HPO example is executed with the above HP as optimization paamateres HPO creates clones of the original experiment, with different configurations (verified in the UI) trains-agent executes said experiments, aand they are not completed.But it seems the paramete...
in the UI, find the task (just search for the Task ID, it will find it), then tight click it, and select "reset"
Let me know if I understand you correctly, the main goal is to control the model serving, and deploy to your K8s cluster, is that correct ?
DeterminedToad86
So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl
See in the log:Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0
But torchvision is downloaded from the cuda 11 folder...
I...
The driver script (the one initializes models and initializes a training sequence) was not at git repo and besides that one, everything is.
Yes there is an issue when you have both git repo and totally uncommitted file, since clearml can store either standalone script or a git repository, the mix of the two is not actually supported. Does that make sense ?
Hi SubstantialElk6
No need for that, you can use the helm chart (or spin them once with kubctl) then they take care of scheduling by themselves.
You can also use the k8s glue (basically spinning kubernetes pods automatically for you, based on the Tasks that you push into the ClearML queue)
https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py
In short, two possible deployments
Static k8s pod running the agent (then the agent runs all the experiments inside t...
Could you amend the original snippet (or verify that it also produces plots in debug samples) ?
(Basically I need something that I can run π )
Go to https://demoapp.trains.allegro.ai/profile
You should see something like 0.16.2-123
Hi ZanyPig66
I used tensorboard as clearml claims to auto-capture tensorboard outputs, but it was a no go.
The auto TB logging should work out of the box, where is it failing ?
Also,task = Task.current_task()
Why aren't you using Task.init in the original script?
The idea is that you run your code on your machine (where the environment works), ClearML auto detects code + python packages + args etc.
Then you clone it in the UI and launch it on a remote machine.
What am I missing ...
AntsyElk37
and when i try to use --output-uri i can't pass true because obviously i can't pass a boolean only strings
hmm, that sounds right, I think we should fix that so when using --output-uri true
the value that is passed is actually True, not the string "true".
Regrading the issue itself:
are you saying --skip-task-init
is being ignored ? and it always adds the Task.init call? you can also pass --output-uri
https://files.clear.ml (which is the same as True) ,...
...I'm not sure I follow, the clearml-task
is designed to always be used so that at the end the agent will be running the Task. What am I missing?
ChubbyLouse32 could it be the configuration file is not passed to the agent machine itself ?
(were you able to run anything against this internal server? I mean to connect to it from code, clearml/cleamrl-agent) ?
BTW: for future reference, if you set the ulimit in the bash, all processes created after that should have the new ulimit
What's the clearml version? Is this with the latest from GitHub?
... if we have direct access to the Kubernetes worker when we run K8S glue?
Correct, if you have a direct access to the Node (on your k8s cluster) from your laptop (assuming the clearml-session is running from the laptop), everything should work
is there a way that i can pull all scalars at once?
I guess you mean from multiple Tasks ? (if so then the answer is no, this is on a per Task basis)
Or, can i get experiments list and pull the data?
Yes, you can use Task.get_tasks to get a list of task objects, then iterate over them. Would that work for you?
https://clear.ml/docs/latest/docs/references/sdk/task/#taskget_tasks
Error 101 : Inconsistent data encountered in document: document=Output, field=model
Okay this point to a migration issue from 0.17 to 1.0
First try to upgrade to 1.0 then to 1.0.2
(I would also upgrade a single apiserver instance, once it is done, then you can spin the rest)
Make sense ?
BTW: I think we had a better example, I'll try to look for one