Reputation
Badges 1
611 × Eureka!Hi SuccessfulKoala55
I meant that in the WebUI deletion should only be allowed for artifacts for which deletion actually works.
For example I now have a lot of lingering artifacts that exist on the fileservers, but not on the clearml-api-server (I think).
Another example: I delete a task via WebUI. ClearML-server tries to delete the task and the artifacts belonging to the task. However, it will show that the task has been successfully deleted but some artifacts have not. Now there is no way...
@<1523701994743664640:profile|AppetizingMouse58> Thank you very much. I forgot the volume mapping.
So can I just add the config to the async_delete container and mirror the directory structure from github?
volumes:
- /opt/clearml/config:/opt/clearml/config
- /opt/clearml/logs:/var/log/clearml
Perfect, thank you 🙂
Alright, that s unfortunate. But thank you very much!
If you think the explanation takes too much time, no worries! I do not want to waste your time on my confusion 😄
SweetBadger76 I am using the Cleanup Service
Thank you very much. I am going to try that.
Is sdk.development.default_output_uri used with s3://ip:9000/clearml or ip:9000/clearml ?
==> 2021-03-11 13:54:59 <==
# cmd: /home/tim/miniconda3/condabin/conda create --yes --mkdir --prefix /home/tim/.clearml/venvs-builds/3.8 python=3.8
# conda version: 4.9.2
+defaults/linux-64::_libgcc_mutex-0.1-main
+defaults/linux-64::ca-certificates-2021.1.19-h06a4308_1
+defaults/linux-64::certifi-2020.12.5-py38h06a4308_0
+defaults/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7
+defaults/linux-64::libedit-3.1.20191231-h14c3975_1
+defaults/linux-64::libffi-3.3-he6710b0_2
+defaults/linux-64...
I got the error again. Seems to happen only when I try to delete "large" experiments.
Okay, I see. Unfortunetly, I don't get how clearml tasks are intended to be used. Could you help me with that? (see code)
` def start_carla_factory():
task = # How do I create this task?
long_blocking_call_to_start_carla()
return task
pipe = PipelineController(
name="carla-autostart",
project="rlad/carla-servers",
version="0.0.1",
add_pipeline_tags=False,
)
pipe.add_step(name="start-carla", base_task_factory=start_carla_factory)
pipe.start() `
SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted ...
But here is the funny thing:
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
Installs GPU
Quick question: Where again does clearml place the venv? I wanna take a look into it after the task has failed
Sounds good. I think it is obvious that immutability has to be managed by the user then, but this is not different from not using clearml-data, so not a disadvantage in my opinion.
And clearml-agent should pull these datasets from network storage...
AgitatedDove14 SuccessfulKoala55 Could you briefly explain whether clearml supports no-copy add for datasets?
Yea, the real problem is that I have very large datasets in network storage. I am looking for a way to add the datasets on the networks storage as clearml-dataset.
Yea, the clearml-data is immutable, but not the underlying data if I just store a pointer to some location.
Maybe a related question: Anyone every worked with datasets larger than the clearml-agent cache? Some colleague of mine has a dataset of ~ 1 tera byte...
I ll add creating an issue to my todo list
Okay, thanks for the info! I am currently not using k8s, but may be good to know for the future.
Hey, thank you for answering.
I know this issue and I have it sometimes, but my current issue is a direct result of me trying to make SSL work. So I am not asking for help in solving my problem, but only for help how to debug. Finding out which step leads to the artifact not being deleted (e.g. the fileserver cannot be reached by from wherever the delete request is send)
For now I can tell you that with conda_freeze: true it fails, but with conda_freeze: false it works!
@<1523701205467926528:profile|AgitatedDove14> Thank you very much for your guidance. Setting these manually works for me!
Good idea. No, clearml-agent does not crash and works fine afterwards. Then it is probably some other problem with my machine. Thank you!