This https://stackoverflow.com/questions/65109764/wildcard-search-issue-with-long-datatype-in-elasticsearch says long types can be converted to string to do the search
But I would need to reindex everything right? Is that a expensive operation?
AgitatedDove14 Should I create an issue for this to keep track of it?
From my experience, I only installed cuda drivers on my machines. I didn't used conda to install torch nor cudatoolkit, I just let clearml-agent download the torch wheel file and install it
and with this setup I can use GPU without any problem, meaning that the wheel does contain the cuda runtime
Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didn’t think there could be a problem with ES…
Yes, it works now! Yay!
AgitatedDove14 The first time it installs and create the cache for the env, the second time it fails with:Applying uncommitted changes ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found. clearml_agent: ERROR: Command '['/home/user/.clearml/venvs-builds.1/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsmncaxx45.txt']' returned non-zero exit status 1.
it actually looks like I don’t need such a high number of files opened at the same time
yes, the only thing I changed is:install_requires=[ ... "my-dep @ git+
]
to:install_requires=[ ... "git+
"]
I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent
(Just to know if I should wait a bit or go with the first solution)
Are you planning to add a server-backup service task in the near future?
ok, thanks SuccessfulKoala55 !
Hi CostlyOstrich36 ! no I am running on venv mode
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...
For the moment this is what I would be inclined to believe
Is it because I did not specify --gpu 0
that the agent, by default pulls one experiment per available GPU?
yes, in the code, i do:task._wait_for_repo_detection() REQS_TASK = ["torch==1.3.1", "pytorch-ignite @ git+
", "."] task._update_requirements(REQS_TASK) task.execute_remotely(queue_name=args.queue, clone=False, exit_process=True)
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
now I can do nvcc --version
and I getCuda compilation tools, release 10.1, V10.1.243
in clearml.conf:agent.package_manager.system_site_packages = true agent.package_manager.pip_version = "==20.2.3"
Interestingly, I do see the 100gb volume in the aws console:
Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed 😄
Ha nice, makes perfect sense thanks AgitatedDove14 !
Yes, not sure it is connected either actually - To make it work, I had to disable both venv caching and set use_system_packages to off, so that it reinstalls the full env. I remember that we discussed this problem already but I don't remember what was the outcome, I never was able to make it update the private dependencies based on the version. But this is most likely a problem from pip that is not clever enough to parse the tag as a semantic version and check whether the installed package ma...
can it be that the merge op takes so much filesystem cache that the rest of the system becomes unresponsive?