
Reputation
Badges 1
25 × Eureka!Oh that makes sense, This depends on how you setup the clearml k8s glue, (becuase the resource allocation is done by k8s) a good hack to limit the number of containers per GPU is to set a RAM limitation per pod, then k8s will know to limit the number of pods on the same GPU machine,
wdty?
Are you suggesting the default "ubuntu:18.04" is somehow contaminated ?
This is an official Ubuntu container (nothing to do with ClearML), this is Very Very odd...
Hi ShallowArcticwolf27
First of all:
If the answer to number 2 is no, I'd loveee to write a plugin.
Always appreciated ❤
Now actually answering the Q:
Any torch.save (or any other framework save) will either register or automatically upload, the file (or folder) in the system. If this is a folder it will be zipped and uploaded, if a file just uploaded to to the assigned storage output (the cleaml-server, any object storage service, or shared folder). I'm not actually sure I...
ExcitedFish86 this is a general "dummy agent" that tasks and executes them (no env created, no code cloned, as you suggested)
hows does this work with HPO?
The HPO clones Tasks, changes arguments, push them into a queue, and monitors the metrics in real time. The missing part (from my understanding) was the the execution of the Tasks themselves required setup, and that you wanted multiple machine support, in order to overcome it, I post a dummy agent that just runs the Tasks.
(Notice...
The easiest would be as an artifact (I think).
Let's assume you put it into a csv file (with pandas or mnaually)
To upload (from the pipeline Task itself):task.upload_artifacts(name='summary', artifact_object='~/my/summary.csv')
Then if you want to grab it from anywhere else:task = Task.get_task(task_id='HPO controller Task id here') my_csv = Task.artifacts['summary'].get_local_copy()
If you want to store as dict it might be even easier:
` task.upload_artifacts(name='summary', artifa...
ReassuredTiger98 yes this is odd:
also:Warning, could not locate PyTorch torch==1.12 matching CUDA version 115, best candidate 1.12.0.dev20220407
Seems like it found a matching version and did not use it...
Let me check that
Or is this a feature of hyperdatasets and i just mixed them up.
Ohh yes, this is it. Hyper Datasets are part of the UI (i.e. there is a Tab with the HyperDataset query) Dataset Usage is currently listed on the Task. make sense ?
because it should have detected it...
Did you see "Repository and package analysis timed out ..."
The issue only arises upon sending Images. (Both numpy, mpl and PIL)
BTW: they should appear under debug-samples
Tab in the results
Okay I think I found the confusion here (and it is confusing, but also very cool)
This line:metrics_names = {"metrics": ["name", "bias", "r2"]} task.connect(metrics_names)
When running in "manual mode" (i.e. not by an agent), will take the dict metrics_names
and put it on the Tasks HyperParameters section.
But, when executed by the Agent, it will do the opposite! it will take the data stored on the Task's hyperparameters section and put it back into the metrics_names ` variable...
Hi ConvolutedBee40
If we deploy a task to
clearml-server
, will it automatically scale?
The way it works is with agents and agent glue, basically using k8s as a resource allocator and the clearml agent as orchestrator, did that answer the question ?
Hi @<1526371965655322624:profile|NuttyCamel41>
I think that the only way to actually get huge number of api calls is with a lot of machines.
For example, regardless of the amount of console-logs you print, it will only be a single call, as these are packages every 2-10 seconds. The same with metric reporting etc.
On the free tier you cal already test the amount of API calls, I think the mechanism is exactly the same
fyi: I would put this question in the channel
Have to get glue setup, which I couldn’t understand fully, so that’s a different topic
I suggest using the apply template setup (basically you provide a Job/Service template, and it uses that to setup k8s jobs based on the Tasks coming in from the specific queue)
SubstantialElk6
The ~<package name with first name dropped> == a.b.c
is a known conda/pip temporary install issue. (Some left over from previous package install)
The easiest way is to find the site-packages folder and delete the package, or create a new virtual environment
BTW:
pip freeze will also list these broken packages
RoughTiger69
move the files locally (i.e. based on the example move folder b
into folder a
) Create a new version with two parents ('a' and 'b') then sync the local root folder ('a' in your case). Only the meta-data should change (because the referenced files are already in one of the datasets)wdyt?
Hi JitteryCoyote63
The NVIDIA_VISIBLE_DEVICES
is set automatically for the process the trains-agent spins, so from your code, it is transparent, you can only "see" GPU 0.
(Obviously not using docker you can forcefully change the OS environment in runtime, but you should avoid that ;))
Hi MinuteGiraffe30
Thank you so much for your awesome product!
😍 !
s address 10.68.167.10. I am able to send requests from all other virtual machines on the server to the address 10.68.167.10:8008. However, when I try to do this from my own computer connected to the corporate network via VPN, it fails to connect to 8008.
I'm assuming there is a firewall on the VPN connection itself (i.e. the VPN gateway) that blocks 8008 port, as you already tried curl to 8008 is...
The -m src.train
is just the entry script for the execution all the rest is be taken care by the Configuration section (whatever you pass after it will be ignored if you are using Argparse as it is auto-connects with ClearML)
Make sense ?
with
PipelineController
, is there any way to avoid creating a new development environment for each step of the pipeline?
You are in luck, we are expanding the PipelineController to support functions. basically allowing you to run the step on the node running the entire pipeline, but I'm not sure this covers all angles of the problem.
My main question here is, who/how the initial setup is created by cleaml-agent ?
I would like to be more efficient and re-use that ...
okay this seems like a broken pip install python3.6
Can you verify it fails on another folder (maybe it's a permissions thing, for example if you run in docker mode, then the permissions will be root, as the docker is creating those folders)
Ok, but when
nvcc
is not available, the agent uses the output from
nvidia-smi
right? On one of my machine,
nvcc
is not installed and in the experiment logs of the agent runnin there,
agent.cuda =
is the version shown with
nvidia-smi
Already added to the next agent's version 😉
Hi DeliciousBluewhale87
My theory is that the clearml-agent is configured correctly (which means you see it in the clearml-server). The issue (I think) is that the Task itself (running inside the docker) is missing the configuration. The way the agent passes the configuration into the docker is by mapping a temporary configuration file into the docker itself. If the agent is running bare-metal, this is quite straight forward. If the agent is running on k8s (or basically inside a docker) th...
EnviousPanda91 the host checks if you have a .ssh folder on the machine, if you do, it will copy+mount it into the container, then it will delete the copy when the container is down.
Specifically /tmp/clearml_agent.ssh.rbw8o0t7
is the copy of the .ssh that the agent created, and now it is mounting it into the container
- Set hashed passwords with
pass_hashed: true
- Generate passwords using
python3 -c 'import bcrypt,base64; print(base64.b64encode(bcrypt.hashpw("password".encode(), bcrypt.gensalt())))'
(obviously, replace "password" with the actual password). The resulting b64 string should be placed in the password field for each user.
For example, assuming your password is "123456": - bash:
> python3 -c 'import bcrypt,base64; print(base64.b64encode(bcrypt.hashpw("123456".encode(), bcrypt.gensal...
Sure this is basically REST query 🙂
` from clearml.backend_api.session.client import APIClient
client = APIClient()
models = client.models.get_all(name='regexp', tags=['demo'], project=['project_id'])
print(models) `
RobustSnake79 let's assume that the trace figure above is probably too much to get into the WebUI, which simple figures might still have value in your scenario ?
can i run it on an agent that doesn't have gpu?
Sure this is fully supported
when i run clearml-serving it throughs me an error "please provide specific config.pbtxt definion"
Yes this is a small file that tells the Triton server how load the model:
Here is an example:
https://github.com/triton-inference-server/server/blob/main/docs/examples/model_repository/inception_graphdef/config.pbtxt