Reputation
Badges 1
606 × Eureka!Seems more like a bug or something is not properly configured on my side.
I think doing all that work is not worth it right now, I am just trying to understand why I clearml seems not to be designed something like this:
` task_name = args.task_name
task = Task()
task = task.load_statedict(await Task.load_or_create(task_name))
task.requirements.add(...)
await task.synchronize()
task.execute_remotely(queue_name, exit=True) `
You mean I should have opencv/ffmpeg available on the clearml-server machine?
Both, actually. So what I personally would find intuitive is something like this:
` class Task:
def load_statedict(self, state_dict):
pass
async def synchronize(self):
...
async def task_execute_remotely(self):
await self.synchronize()
...
def add_requirement(self, requirement):
...
@classmethod
async def init(task_name):
task = Task()
task.load_statedict(await Task.load_or_create(task_name))
await tas...
CostlyOstrich36 Actually no container exits, so I guess if it s because of OOM like SuccessfulKoala55 implies, than maybe a process inside the container gets killed and the container will hang? Is this possible?
SuccessfulKoala55 I did not observe elastic to use much RAM (at least right after starting). Doesn't this line in the docker-compose control the RAM usage?ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted ...
Could be clean log after restart. Unfortunately, I restarted the server right away 😞 I gonna post if it happens again with the appropriate logs.
Thank you very much, didnt know about that 🙂
Yea, is there a guarantee that the clearml-agent will not crash because it did not clean the cache in time?
But the problems seem to be reoccuring
pytorch.tensorboard is the same as tensorboardx https://github.com/pytorch/pytorch/blob/6d45d7a6c331ddb856ac34a76bcd3613aa05185b/torch/utils/tensorboard/summary.py#L461
Good idea. No, clearml-agent does not crash and works fine afterwards. Then it is probably some other problem with my machine. Thank you!
Ah, perfect. Did not know this. Will try! Thanks again! 🙂
AgitatedDove14 Thank you, that explains it.
I can put anything there: s3://my_minio_instance:9000 /bucket_that_does_not_exist
and it will work.
To summarize: The scheduler should assign tasks the the agent first, which gives a queue the highest priority.
What I get for args
when I print it locally is not the same as what ClearML extracts from args
.
Then if the first agent is assigned a task of queue B if the next task is of type A it will have to wait, even though in theory there is capacity for it, if the first task had be executed on the second agent initially.
btw: I am pretty sure this used to work, but then stopped work some time ago.
Yea, but could also be for other reasons. I ll try to find out somehow.
How can I get the agent log?
It is not explained there, but do you meanCLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-} CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY:-}
?
However, to use conda as package manager I need a docker image that provides conda.
I see. Thanks a lot!
Yea, tensorboardX is using moviepy.
It didn't revert. Just one of my colleagues that I wanted to introduce to clearml put his clearml.conf in the wrong directory and pushed his experiments to the public server.
So I do not blame clearml for this mistake, but generally designing the system to be fail-safe is better than hope that everything is used like it has been designed 🙂