Reputation
Badges 1
25 × Eureka!This means that if something happens with the k8s node the pod runs on,
Actually if the pod crashed (the pod not the Task) k8s should re spin it, no?
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
From the k8s perspective, if the task ended (failed/completed) it always return with exit code 0, i.e. success. Because the agent was able to spin the Task. We do not want Tasks with exception to litter the k8s with endless r...
Hi GentleSwallow91
I am very much concerned with docker container spin up time.
To accelerate spin up time (mostly pip install) use the venv cahing (basically it will store a cache of the entire installed venv so it oes not need to reinstall it)
Unmark this line:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L116
The problem above could be that I used a non-root user to train a model and all packages are installed for ...
Hi TrickyRaccoon92
... would any running experiment keep a cache of to-be-sent-data, fail the experiment, or continue the run, skipping the recordings until the server is back up?
Basically they will keep trying to send data to server until it is up again (you should not loose any of the logs)
Are there any clever functionality for dumping experiment data to external storage to avoid filling up the server?
You mean artifacts or the database ?
If i have an alternative location for the vscode, where should i indicate in the configuration?
We might need to add support for that, but it should not be a problem to override (e.g. downloadable link like http/s3/ etc.)
Is this something that is doable ?
SubstantialElk6 could you add a github issue to set the direct url for the vscode as a parameter to the cleaml-session?
We already have --vscode-version
we could either extend it to include a direct url, or add a new argument.
wdyt ?
and what is --storage s3//:inference
?
if you are using minio it should be something like None
Notice you have to specify the IP:port otherwise it thinks it is an AWS endpoint
Notice that the new pip syntax:packagename @ <some_link_here>
Is actually interpreted by pip as :
Install "packagename" if it is not installed use the @ "<some_link_here>" to install it.
DilapidatedDucks58 use a full link , without the package namegit+
Hi DilapidatedDucks58
how to force-reinstall package from github in Installed Packages
You mean make sure that the agent installs it from github?
The "Installed packages" section is equivalent to "requirements.txt" anything you can put in requirements.txt, you can put there.
For example adding to "Installed Packages"git+
Will make sure you install the latest clearml from GitHub.
Notice that you cannot have two packages with the same name (just like with regular requirements.txt)...
This, however, requires that I slightly modify the clearml helm chart with the aws-autoscaler deployment, right?
Correct π
So this is verry odd, it looks like a pip bug:
The agent is trying to install torch==2.1.0.*
because by default it ignores the 4th+ parts (they are unstable and torch have tendency to remove them) . and for some reason pip will not match 2.1.0.*
with for example "2.1.0.dev20230306+cu118"
but based on the docs it should work:
see here: None
As a workaround you can always edit and change to the final url for example: so ...
For visibility, after close inspection of API calls it turns out there was no work against the saas server, hence no data
Thanks CynicalBee90 I appreciate the discussion! since I'm assuming you will actually amend the misrepresentation in your table, let me followup here.
1.
SPSS license may be a significant consideration for some, and so we thought it was important to point this out clearly.
SPSS is fully open-source compliant unless you have the intention of selling it as a service, I hardly think this is any users consideration, just like anyone would be using mongodb or elastic search without think...
Hi CynicalBee90
Always great to have people joining the conversation, especially if they are the decision makers a.k.a can amend mistakes π
If I can summarize a few points here (and feel free to fill in / edit any mistake or leftovers)
Open-Source license: This is basically the mongodb license, which is as open as possible with the ability to, at the end, offer some protection against Amazon giants stealing APIs (like they did for both mongodb and elastic search) Platform & language agno...
Hi @<1556450111259676672:profile|PlainSeaurchin97>
While testing the migration, we found that all of our models had their
MODEL URL
set to the IP of the old server.
Yes all the artifacts/models/debug-samples are stored "as is" , this means that if you configured your original setup with IP, it is kind of stuck there, this is why it is always preferred to use host-name ...
you apparently also need to rename
all
model URLs
Yes π
That might be me, let me check...
JitteryCoyote63 are you calling to:my_task.output_uri = "
s3://my-bucket
in the code itself ?
Why not with Task.init output_uri=...
Also this is running remotely there is no need fo r that, use the Execution -> Output -> Destination and put it there, it will do everything for you π
Yes I suspect it is too large π
Notice that most parts have default values so there is no need to specify them
Can you share the log?
Regarding the agentΒ - No particular reason. Can you point me on how to do it?
This is a good place to start
https://clear.ml/docs/latest/docs/getting_started/mlops/mlops_first_steps
We need the automagic...Β
This is one of the great benefits of using clearmlΒ
π
Sure, try this one:Task.debug_simulate_remote_task('reused_task_id') task = Task.init(...)
Notice it will take the arguments form the cleaml-task itself (e.g. override argparse arguments with what ...
Hmmm why don't you use "series" ?
(Notice that with iterations, there is a limit to the number of images stored per title/series , which is configurable in trains.conf, in order to avoid debug sample explosion)
Are you running the agent in docker mode ?
Is there a mount to the host machine ?
Hi JumpyDragonfly13
Let's assume we have two machines, one we call remote, one we call laptop (at least for this discussion)
On the Remote machine we need to run: (notice we must have docker preinstalled on the remote machine, it can work without docker, let me know if this is the case for you)clearml-agent daemon --queue interactive --create-queue --docker
On the Laptop we runclearml-session --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
What clearml-session will do is crea...
Basically run the 'agentin virtual environment mode JumpyDragonfly13 try this one (notice no --docker flag)
clearml-agent daemon --queue interactive --create-queue Then from the "laptop" try to get a remote session with:
clearml-session `
Hi DeliciousKoala34
This means the pycharm plugin was not able to run git on your local machine.
Whats your OS ?
could it be that if you open cmd / shell "git" is not in the path ?
Hi RoughTiger69
One quirk I found was that even with this flag on, the agent decides to install whatever is in the requirements.txt
Whats the clearml-agent you are using?
I just noticed that even when I clear the list of installed packages in the UI, upon startup, clearml agent still picks up the requirements.txt (after checking out the code) and tries to install it.
It can also just skip the entire Python installation with:CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
My bad you have to pass it to the container itself:
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L149extra_docker_arguments: ["-e", "CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1"]