Reputation
Badges 1
979 × Eureka!I also discovered https://h2oai.github.io/wave/ last week, would be awesome to be able to deploy it in the same manner
That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system
Alright SuccessfulKoala55 I was able to make it work by downgrading clearml-agent to 0.17.2
AppetizingMouse58 the events_plot.json template misses the plot_len
declaration, could you please give me the definition of this field? (reindexing with dynamic: strict
fails with: "mapping set to strict, dynamic introduction of [plot_len] within [_doc] is not allowed
)
ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another π
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog mem...
I cannot share the file itself, but here are some potential helpful points:
Multiple lines empty One line is empty but has spaces (6 to be exact) The last line of the file is empty
is there a command / file for that?
Thanks! With this Iβll probably be able to reduce the cluster size to be on the safe side for a couple of months at least :)
The only thing that changed is the new auth.fixed_users.pass_hashed
field, that I donβt have in my config file
ha wait, I removed the http://
in the host and it worked π
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
yes that makes sense, I will do that. Thanks!
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
Hi AnxiousSeal95 , I hope you had nice holidays! Thanks for the update! I discovered h2o when looking for ways to deploy dashboards with apps like streamlit. Most likely I will use either streamlit deployed through clearml or h2o as standalone if ClearML won't support deploying apps (which is totally fine, no offense there π )
nothing wrong from ClearML side π
There is an example in the https://github.com/allegroai/clearml/blob/master/docs/datasets.md#workflow section of the linked I shared above
Sure, just sent you a screenshot in PM
This https://stackoverflow.com/questions/65109764/wildcard-search-issue-with-long-datatype-in-elasticsearch says long types can be converted to string to do the search
But I would need to reindex everything right? Is that a expensive operation?
AgitatedDove14 Should I create an issue for this to keep track of it?
From my experience, I only installed cuda drivers on my machines. I didn't used conda to install torch nor cudatoolkit, I just let clearml-agent download the torch wheel file and install it
and with this setup I can use GPU without any problem, meaning that the wheel does contain the cuda runtime
Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didnβt think there could be a problem with ESβ¦
Yes, it works now! Yay!
AgitatedDove14 The first time it installs and create the cache for the env, the second time it fails with:Applying uncommitted changes ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found. clearml_agent: ERROR: Command '['/home/user/.clearml/venvs-builds.1/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsmncaxx45.txt']' returned non-zero exit status 1.