
Reputation
Badges 1
70 × Eureka!We run a lot of pipelines that are cpu only with some parallel steps. Its just about improving the execution time
This happens inside the agent, since I use task.execute_remotely() I guess. The agent runs on ubuntu 18.04 and not in docker mode
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I added it to this script and use it as a starting point https://github.com/facebookresearch/fastMRI/bl...
Did you wait for all the other indices to reach yellow status?
yes I waited until everything was yellow
Actually I saw that the RuntimeError: context has already been set
appears when the task is initialised outside if __name__ == "__main__":
I like this approach more but it still requires resolved environment variables inside the clearml.conf
I can figure out a way to resolve it, but is there any other way to get env vars / any value or secret from the host to the docker of a task?
Can you send some more comprehensive log - perhaps there are other messages that are related
which logs do you wish?
docker-compose with entrypoint.sh with python3 -m clearml_agent daemon --docker
"${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}"
--force-current-version
${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
--queue office
I increased already the memory to 8GB after reading similar issues here on the slack`
Just making sure, how exactly did you do that?
docker-compose down
elasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g `` docker-compose up -d
since it is a single node, I guess it will not possible to recover or partially recover the index right?
` Process failed, exit code 1task ab1a90dacb9042eea8e4a6a16640d7f4 pulled from 8f06b6b160c14a3591d791c1885b309e by worker test:gpu1
Running task 'ab1a90dacb9042eea8e4a6a16640d7f4'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.kbkz1n40.txt', '/tmp/.clearml_agent_out.kbkz1n40.txt'
Current configuration (clearml_agent v1.0.0, location: /tmp/.clearml_agent.3e6l7juj.cfg):
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes ...
Solving the replica issue now allowed me to get better insights into why the one index is red.
` {
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-11-09T22:30:47.018Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a...
The output seen above indicates that the index is corrupt and probably lost, but that is not necessary the case
So I don't need docker_internal_mounts
at all?
The agents also share the clearml.conf
file which causes some issue with the worker_id/worker_name. They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
Ok it is more a docker issue, I guess it is not feasible reading the thread.
Exactly, all agents should share the cache that is mounted via nfs. I think it is working now 🙂
` 2021-05-06 13:46:34.032391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:a1:00.0 name: NVIDIA Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-05-06 13:46:34.032496: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: ...
 so you say deleting other old indices that I don't need could help?
This did not help, I still have the same issue
Executing: ['docker', 'run',......] chown: changing ownership of '/root/.cache/pip': Operation not permitted Get:1
focal-security InRelease [114 kB] Get:2
focal InRelease [265 kB] Get:3
focal-updates InRelease [114 kB
It is at the top of the logs
AgitatedDove14 I created a new clean venv and freshly installed the clearml-agent
under python / pip 3.8 and now it is working again. Still don't know what caused this issue. Thank you very much for helping!
My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__":
as seen above in the code snippet.
That's it? no apparent error?
After the logs on the top there was only logs on "info" level with PluginsService
I do have this setting in my clearml.conf filevenvs_cache: { free_space_threshold_gb: 50.0 path: /clearml-cache/venvs-cache }
So it should cache the venvs right? I also see content in the /clearml-cache/venvs-cache
folder. Because I have venvs_cache configured there is nothing in venvs-build, since it uses the cache?
but this would be still part of the clearml.conf right? I would prefer a way around clearml.conf to avoid resolving the variables