Reputation
Badges 1
70 × Eureka!SuccessfulKoala55 so you say deleting other old indices that I don't need could help?
I want to cache as much as possible and /clearml-cache/venvs-cach
(on the host) does contain caches venvs. But /clearml-cache/venvs-builds
is empty. My question was how to also cache venvs_builds
Yes, this happened when the disk got filled up to 100%
The agents also share the clearml.conf
file which causes some issue with the worker_id/worker_name. They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__":
as seen above in the code snippet.
I'm running now the the code shown above and will let you know if there is still an issue
That's it? no apparent error?
After the logs on the top there was only logs on "info" level with PluginsService
The output seen above indicates that the index is corrupt and probably lost, but that is not necessary the case
using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run...
ssh into the elasticsearch container identify the id of the index that seem to be broken run /usr/share/elasticsearch/jdk/bin/java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/your-id/0/index/ -verbose -exorcise
This can be dangerous but is the only option if you assume that the data is lost anyway. either running 3. repairs broken segments or it shows as in my case ` No problems were detected with this i...
Did you wait for all the other indices to reach yellow status?
yes I waited until everything was yellow
Solving the replica issue now allowed me to get better insights into why the one index is red.
` {
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-11-09T22:30:47.018Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a...
the cache on the host is mounted as nfs and the nfs server was configured to not allow the clients to do root operations
, what version of clearml is your server?
the docker-compose use clearml:latest
so you say deleting other old indices that I don't need could help?
This did not help, I still have the same issue
So I don't need docker_internal_mounts
at all?
hm... Now with commenting it out I have the following problem:docker_pip_cache = /clearml-cache/pip-cache
On host:drwxrwxrwx 5 root root 5 Mar 10 17:17 pip-cache
in task logs:chown: changing ownership of '/root/.cache/pip': Operation not permitted
Try to restart ES and see if it helps
docker-compose down / up does not help
` root@ubuntu:/opt/clearml# sudo docker logs clearml-elastic
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
{"type": "server", "timestamp": "2021-11-09T12:49:13,403Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (//some_ip/clearml-server-data)]], net usable_space [3.4tb]...
AgitatedDove14 one more thing regarding the initial question,apt-cache
, pip-cache
, pip-download-cache
, vcs-cache
and venvs-cache
contain data on the shared clearml-cache
but venvs-build
does not? What sort of data would be stored in the venvs-build
folder? I do have venvs_dir = /clearml-cache/venvs-builds
specified in the clearml.conf
so now there is the user conflict between the host and the agent inside the container
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I added it to this script and use it as a starting point https://github.com/facebookresearch/fastMRI/bl...
are they in conflict?
Hey Natan, good point! But I have actually set both
I will try to recover it, but anyway the learning is to fully separate the fileserver and any output location from mongo, redis and elastic. Also maybe it makes sense the improve the ES setup to have replicas
This happens inside the agent, since I use task.execute_remotely() I guess. The agent runs on ubuntu 18.04 and not in docker mode
RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
Hi AgitatedDove14 one more question about efficient caching, is it possible to cache/share docker images between agents?
Also,
how much memory is allocated for ES? (it's in the docker-compose file)
I increased already the memory to 8GB after reading similar issues here on the slack