Reputation
Badges 1
979 × Eureka!Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didn’t think there could be a problem with ES…
Yes, it works now! Yay!
AgitatedDove14 The first time it installs and create the cache for the env, the second time it fails with:Applying uncommitted changes ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found. clearml_agent: ERROR: Command '['/home/user/.clearml/venvs-builds.1/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsmncaxx45.txt']' returned non-zero exit status 1.
it actually looks like I don’t need such a high number of files opened at the same time
yes, the only thing I changed is:install_requires=[ ... "my-dep @ git+
]
to:install_requires=[ ... "git+
"]
torch==1.7.1 git+
.
I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent
(Just to know if I should wait a bit or go with the first solution)
Are you planning to add a server-backup service task in the near future?
ok, thanks SuccessfulKoala55 !
Hi CostlyOstrich36 ! no I am running on venv mode
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...
For the moment this is what I would be inclined to believe
Is it because I did not specify --gpu 0
that the agent, by default pulls one experiment per available GPU?
yes, in the code, i do:task._wait_for_repo_detection() REQS_TASK = ["torch==1.3.1", "pytorch-ignite @ git+
", "."] task._update_requirements(REQS_TASK) task.execute_remotely(queue_name=args.queue, clone=False, exit_process=True)
I am still confused though - from the get started page of pytorch website, when choosing "conda", the generated installation command includes cudatoolkit, while when choosing "pip" it only uses a wheel file.
Does that mean the wheel file contains cudatoolkit (cuda runtime)?
now I can do nvcc --version
and I getCuda compilation tools, release 10.1, V10.1.243
in clearml.conf:agent.package_manager.system_site_packages = true agent.package_manager.pip_version = "==20.2.3"
Interestingly, I do see the 100gb volume in the aws console:
Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed 😄
Ha nice, makes perfect sense thanks AgitatedDove14 !
Yes, not sure it is connected either actually - To make it work, I had to disable both venv caching and set use_system_packages to off, so that it reinstalls the full env. I remember that we discussed this problem already but I don't remember what was the outcome, I never was able to make it update the private dependencies based on the version. But this is most likely a problem from pip that is not clever enough to parse the tag as a semantic version and check whether the installed package ma...
can it be that the merge op takes so much filesystem cache that the rest of the system becomes unresponsive?
but according to the disks graphs, the OS disk is being used, but not the data disk
AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration
function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it
with the CLI, on a conda env located in /data
Would adding a ILM (index lifecycle management) be an appropriate solution?