Reputation
Badges 1
611 × Eureka!With clearml==1.4.1 it works, but with the current version it aborts. Here is a log with latest clearml
Okay, I found something out: When I use docker image ubuntu:22.04 it does not spin up a service agent and aborts the task. When I used python:latest everything works fine!
@<1523701435869433856:profile|SmugDolphin23> Good catch. I have a good but unsatisfying message for you guys: I restarted the whole machine (server and agent) and now it works fine ...
Thank you for clearing that up 🙂
Just multiple users who do not share their repositories. So sharing with the agent is also not possible.
Well, I guess no hurdles vs. safety is inherently no solvable. I am all for hurdles, if it is clear how to overcome it. And in my opinion referring to clearml-init is something which makes sense from a developer and a user perspective.
Thank you very much for the quick answer. Still so confusing to me that so many things are configured client side 😄
Here is some context on what I am currently trying to do (pseudocode):
`
def run_experiment(args):
...
def get_task_experiment():
task = Task.init(...)
task.bind_run(run_experiment)
return task
def run_with_pipeline(task):
pipe = PipelineController(...)
pipe.add_step(prepare_something...)
pipe.add_step(task)
pipe.add_step(postprocess_something...)
return pipe
if name == "main":
task = get_task_experiment()
# Run without Pipeline
if ...
Same here with most recent clearml-server 😞
Yea. Not using the config file does not seem like a good long-term solution for me. However, I still have no idea, why this error happens. But enough for today. Thank you a lot for your help!
Type "help", "copyright", "credits" or "license" for more information.
>>> from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
>>> get_driver_cuda_version()
'110'
The script is intended to be used something like this:script.py train my_model --steps 10000 --checkpoint-every 10000
orscript.py test my_model --steps 1000
You mean I should have opencv/ffmpeg available on the clearml-server machine?
Thanks, I will consider that!
Yes, I do not want to rely on the clearml-agent. Afaik the clearml-sdk in the container does the downloading and since a host directory is mounted, it is mirrored there. If it was possible to not mount the host directory, everything would be contained 🙂
CostlyOstrich36 Actually no container exits, so I guess if it s because of OOM like SuccessfulKoala55 implies, than maybe a process inside the container gets killed and the container will hang? Is this possible?
SuccessfulKoala55 I did not observe elastic to use much RAM (at least right after starting). Doesn't this line in the docker-compose control the RAM usage?ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
Hey, that is unfortunately not possible as there are multiple projects from different users.
Yes, that looks alright. Similar to before. Local execution works.
Hey AgitatedDove14 is there any update on this?
For example I run a task remotely. Now I decide I want to rerun it, but slightly change a parameter. So I clone the task and edit the parameter in the WebUI. Then I submit the task to a queue. When the clearml-agent pulls the tasks and tries to install the requirements, it will fail since the task requirements now contain packages that had been preinstalled in the environment (e.g. nvidia docker). These packages may not be available via pip, so the run will fail.
It seems to work when I enable conda_freeze .
Yea, I am still trying to get docker to work with clearml. I do not have much experience with docker besides creating Dockerfiles and it seems like the ~/.ssh/config ownership seems broken when mounted into the container on my workstations.
Locally it works fine.
Could you guide me to the documentation for using the docker file? I am not able to find it. I only found task.set_base_docker which I am not sure what it does.