Reputation
Badges 1
979 × Eureka!But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
And if you need a very small change, you can also simply https://www.geeksforgeeks.org/monkey-patching-in-python-dynamic-behavior/ it
It failed as well
The task I cloned from is not the one I though
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0)
Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)
So I guess it’s not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
Here is (left) the data disk (/opt/clearml) and right the OS disk
well I still see some ES errors in the logs
` clearml-apiserver | [2021-07-07 14:02:17,009] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 65750ms, msg=General data error: err=('500 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'c2068648d2fe5da975665985f44c20b6', 'status':..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not...
But I would need to reindex everything right? Is that a expensive operation?
SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?
AgitatedDove14 I eventually found a different way of achieving what I needed
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
yes, in the code, i do:task._wait_for_repo_detection() REQS_TASK = ["torch==1.3.1", "pytorch-ignite @ git+
", "."] task._update_requirements(REQS_TASK) task.execute_remotely(queue_name=args.queue, clone=False, exit_process=True)
AgitatedDove14 https://clear.ml/docs/latest/docs/apps/clearml_session/#running-in-docker in the docs there is a --docker
option, that’s what confuses me, since the agent should always run in docker mode
Just tried, still the same issue
how would it interact with the clearml-server api service? would it be completely transparent?
"Can only use wildcard queries on keyword and text fields - not on [iter] which is of type [long]"
ok, now I actually remember why I used _update_requirements instead of add_requirements: The first overwrites all the other, the later only add to the already detected packages. Since my deps are listed in the dependencies of my setup.py, I don't want clearml to list the dependencies of the current environment
AgitatedDove14 I have a machine with two gpus and one agent per GPU. I provide the same trains.conf to both agents, so they use the same directory for caching venvs. Can it be problematic?
I have two controller tasks running in parallel in the trains-agent services queue
So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.
region is empty, I never entered it and it worked
So most likely trains was masking the original error, it might be worth investigating to help other users in the future