==> 2021-03-11 13:54:59 <==
# cmd: /home/tim/miniconda3/condabin/conda create --yes --mkdir --prefix /home/tim/.clearml/venvs-builds/3.8 python=3.8
# conda version: 4.9.2
+defaults/linux-64::_libgcc_mutex-0.1-main
+defaults/linux-64::ca-certificates-2021.1.19-h06a4308_1
+defaults/linux-64::certifi-2020.12.5-py38h06a4308_0
+defaults/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7
+defaults/linux-64::libedit-3.1.20191231-h14c3975_1
+defaults/linux-64::libffi-3.3-he6710b0_2
+defaults/linux-64...
I got the error again. Seems to happen only when I try to delete "large" experiments.
Okay, I see. Unfortunetly, I don't get how clearml tasks are intended to be used. Could you help me with that? (see code)
` def start_carla_factory():
task = # How do I create this task?
long_blocking_call_to_start_carla()
return task
pipe = PipelineController(
name="carla-autostart",
project="rlad/carla-servers",
version="0.0.1",
add_pipeline_tags=False,
)
pipe.add_step(name="start-carla", base_task_factory=start_carla_factory)
pipe.start() `
SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted ...
Quick question: Where again does clearml place the venv? I wanna take a look into it after the task has failed
Sounds good. I think it is obvious that immutability has to be managed by the user then, but this is not different from not using clearml-data, so not a disadvantage in my opinion.
And clearml-agent should pull these datasets from network storage...
AgitatedDove14 SuccessfulKoala55 Could you briefly explain whether clearml supports no-copy add for datasets?
Yea, the real problem is that I have very large datasets in network storage. I am looking for a way to add the datasets on the networks storage as clearml-dataset.
Yea, the clearml-data is immutable, but not the underlying data if I just store a pointer to some location.
Maybe a related question: Anyone every worked with datasets larger than the clearml-agent cache? Some colleague of mine has a dataset of ~ 1 tera byte...
I ll add creating an issue to my todo list
Okay, thanks for the info! I am currently not using k8s, but may be good to know for the future.
Hey, thank you for answering.
I know this issue and I have it sometimes, but my current issue is a direct result of me trying to make SSL work. So I am not asking for help in solving my problem, but only for help how to debug. Finding out which step leads to the artifact not being deleted (e.g. the fileserver cannot be reached by from wherever the delete request is send)
For now I can tell you that with conda_freeze: true it fails, but with conda_freeze: false it works!
@<1523701205467926528:profile|AgitatedDove14> Thank you very much for your guidance. Setting these manually works for me!
Good idea. No, clearml-agent does not crash and works fine afterwards. Then it is probably some other problem with my machine. Thank you!
No reason in particular. How many people work at http://allegro.ai ?
Hey Martin, thank you for answering!
I see your point, however in my opinion this is really unexpected behavior. Sure, I can do some work to make it "safe", but shouldn't that be default. So throw an error without clearml.conf and expect CLEARML_USE_DEFAULT_SERVER=1 ` .
Thanks for researching this issue. If you have time, you can create the issue since you are way more knowledgeable, but I can also open it if you do not have time 🙂
I just manually went into the docker container and ran python -m venv env --system-site-packages and activated the virtual env.
When I run pip list then, it correctly shows the preinstalled packages including torch 1.12.0a0+2c916ef
AgitatedDove14 Could you elaborate?
I have an carla.egg file on my local machine and on the worker that I include with sys.path.append before I can do import carla . It is the same procedure on my local machine and on the clearml-agent worker.
Hi TimelyMouse69 Thank you for your answer.
I use 3.10.8 locally and 3.10.6 remotely. Everything is run in a docker container, locally and remotely on the docker-agent (exactly the same docker image).
Thank you for looking into the disappearing dev . It seems like this should be the reason for pip trying to install a stable version of 1.14, which does only exist as nightly
Unfortunately, not. Quick question: Is there caching happening somewhere besides .clearml ? Does the boto3 driver create cache?