Hi @<1634001100262608896:profile|LazyAlligator31>
Is this because the code repo is being recreated in this directory?
Yes this is correct π
Basically the entire code base + venv is installed there, to make sure it does not intyerfere with the "system" preinstalled environment
(it also allows for caching on the host machine π )
Hi OutrageousSheep60
Do you mean something like:
https://github.com/allegroai/clearml/tree/master/examples/datasets
?
Hi @<1556812486840160256:profile|SuccessfulRaven86>
Every clearml-serving session (you can have multiple different "sessions") is assumed to be homogeneous, this would mean it will serve the same models on as many nodes as possible supporting multiple models per pod.
In your example I think the easiest is to create two serving sessions one with a node selector for the 24GB node and another for the 16GB node, wdyt?
FreshReindeer51
Could you provide some logs ?
does the clearml server is a worker i can serve on models?
The serving is done by one of the clearml-agents.
Basically you spin an agent, then this agent is spinning the model serving engine container (fully managed).
(1) install run run clearml-agent (2) run clearml-session CLI to configure and spin the serving engine
Correct the serving Task ID is the clearml serving session. It is the instance that holds all the information of this specific setup and models
Hi @<1523711619815706624:profile|StrangePelican34>
Hmm, I think this is missing from the docs, let me ping the guys about that π
OutrageousSheep60
I found the task in the UI -
and in the
UNCOMMITTED CHANGES
execution section there is
No changes logged
This is the issue.
and then run the
session
via docker
clearml-session --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 \ --packages "clearml" "tensorflow>=2.2" "keras" \ --queue MY_QUEUE \ --verbose
Are you running the "cleamrl-session" from your machine? (i.e. not from inside a docker) ?...
clearml_agent: ERROR: Can not run task without repository or literalscript in
script.diff
This is odd ...
OutrageousSheep60 when you launch clearml-session
it tells you the session ID (which is also a Task ID), can you look for it in the UI and check there is something in the repo/uncommitted-changes section ?
OK - the issue was the firewall rules that we had.
Nice!
But now there is an issue with the
Setting up connection to remote session
OutrageousSheep60 this is just a warning, basically saying we are using the default signed SSH server key (has nothing to do with the random password, just the identifying key being used for the remote ssh session)
Bottom line, I think you have everything working π
Good question π
https://clear.ml/docs/latest/docs/clearml_agent#dynamic-gpu-allocation
The latest updated help will always be here as well πclearml-agent daemon --help
But I am considreing just failing the task.
This will of course work, just raise exception in the Task itself, and protect the call from the pipeline logic function with try/except
regrading the second option, try to nullify the hash on the Component Task:
# running the Task component here
# if we do not want someone to use us
Task.current_task()._set_runtime_properties({"pipeline_job_hash": None})
try to break it into parts and understand what produces the error
for example:increase(test12_model_custom:Glucose_bucket[1m])
increase(test12_model_custom:Glucose_sum[1m])
increase(test12_model_custom:Glucose_bucket[1m])/increase(test12_model_custom:Glucose_sum[1m])
and so on
Hi DilapidatedDucks58 ,
Are you running in docker or venv mode?
Do the works share a folder on the host machine?
It might be syncing issue (not directly related to the trains-agent but to the facts you have 4 processes trying to simultaneously access the same resource)
BTW: the next trains-agent RC will have a flag (default off) for torch-nightly repository support π
CloudyHamster42
RC probably in a few days, but notice that it will just remove the warnings, I still can't reproduce the double axis issue.
It will be helpful if you could send a small script to reproduce the problem.
Maybe this example code can help ? https://github.com/allegroai/trains/blob/master/examples/manual_reporting.py
should I update nodejs in centos image ?
I think so, it might have been forgotten
Thanks for answering, Yes, this is exactly what I wanted
Hmm should be possible, how slow is the update that we want to save the time ?
Hi OutrageousGrasshopper93
I think that what you are looking for is Task.import_task and Task.export
https://allegro.ai/docs/task.html#trains.task.Task.import_task
https://allegro.ai/docs/task.html#trains.task.Task.export_task
you mean to spin a pod with the agent inside it (daemon in services mode).
Or connect the services queue to the k8s cluster (i.e. define the pod template that uses cpu with not a lot of ram)?
Hi ColossalDeer61 ,
Xxx is the module where my main experiment script resides.
So I think there are two options,
Assuming you have a similar folder structure-main_folder
--package_folder
--script_folder
---script.py
Then if you set the "working directory" in the execution section to "." and the entry point to "script_folder/script.py", then your code could do:from package_folder import ABC
2. After cloning the original experiment, you can edit the "installed packages", and ad...
I want to use services queue for running services, and I want to do it on k8s
So yes, as a standalone pod with the agent in venv mode (as opposed to docker mode)
Does that make sense to you?
I guess it wonβt due to the nature of services?
Correct, k8s glue works differently, that said I would actually use the helm to spin a pod woth the agent in services mode and venv mode.
Makes sense to add it to docker run by default if GPUs are mentioned in agent.
I think this is an arch thing, --privileged is not needed on ubuntu flavor, that said you can always have it if you add it here:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L149
clearml-agent daemon --gpus 0 --queue default --docker
But docker still sees all GPUs.
Yes --gpus should be enough, are you sure regrading the --privileged flag ?
SmugLizard25 are you saying that with the latest version it does not work?
one can containerise the whole pipeline and run it pretty much anywhere.
Does that mean the entire pipeline will be running on the instance spinning the container ?
From here: this is what I understand:
https://kedro.readthedocs.io/en/stable/10_deployment/06_kubeflow.html
My thinking was I can use one command and run all steps locally while still registering all "nodes/functions/inputs/outputs etc" with clearml such that I could also then later go into the interface and clone an...