Reputation
Badges 1
606 × Eureka!I guess this is the current way to do it: https://github.com/tensorflow/tensorboard/issues/39#issuecomment-568917607 so I would say: Yes, it supports gif.
Is this working in the latest version? clearml-agent falls back to /usr/bin/python3.8
no matter how I configure clearml.conf
Just want to make sure, so I can investigate what's wrong with my machine if it is working for you.
Installed packages:
` # Python 3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0]
absl-py==0.12.0
aiostream==0.4.2
attrs==20.3.0
cached-property==1.5.2
cffi==1.14.5
chardet==4.0.0
clearml==0.17.5
cython==0.29.22
dm-control==0.0.364896371
dm-env==1.4
dm-tree==0.1.5
fasteners==0.16
furl==2.1.0
future==0.18.2
glfw==2.1.0
gym==0.18.0
h5py==3.2.1
humanfriendly==9.1
idna==2.10
imageio-ffmpeg==0.4.3
importlib-metadata==3.7.3
jsonschema==3.2.0
labmaze==1.0.4
lxml==4.6.3
moviepy==1.0.3
mujoco-py==...
Maybe this opens up another question, which is more about how clearml-agent is supposed to be used. The "pure" way would be to make the docker image provide everything and clearml-agent should do not setup at all.
What I currently do instead is letting the docker image provide all system dependencies and let clearml-agent setup all the python dependencies. This allows me to reuse a docker image for more different experiments. However, then it would make sense to have as many configs as possib...
Hi CostlyOstrich36 , thank you for answering so quick. I think that s not how it works because if this was true, one would have to always match local machine to servers. Afaik clearml finds the correct PyTorch Version, but I was not sure how (custom vs pip does it)
An upload of 11GB took around 20 hours which cannot be right. Do you have any idea whether ClearML could have something to do with this slow upload speed? If not I am going to start debugging with the hardware/network.
Thanks for your help again. I will just use detect_with_conda_freeze: true
. Seems like a perfect solution for me!
Ah, it actually is also a string with remote_execution, but still not what it should be.
Seems more like a bug or something is not properly configured on my side.
I think doing all that work is not worth it right now, I am just trying to understand why I clearml seems not to be designed something like this:
` task_name = args.task_name
task = Task()
task = task.load_statedict(await Task.load_or_create(task_name))
task.requirements.add(...)
await task.synchronize()
task.execute_remotely(queue_name, exit=True) `
You mean I should have opencv/ffmpeg available on the clearml-server machine?
Both, actually. So what I personally would find intuitive is something like this:
` class Task:
def load_statedict(self, state_dict):
pass
async def synchronize(self):
...
async def task_execute_remotely(self):
await self.synchronize()
...
def add_requirement(self, requirement):
...
@classmethod
async def init(task_name):
task = Task()
task.load_statedict(await Task.load_or_create(task_name))
await tas...
CostlyOstrich36 Actually no container exits, so I guess if it s because of OOM like SuccessfulKoala55 implies, than maybe a process inside the container gets killed and the container will hang? Is this possible?
SuccessfulKoala55 I did not observe elastic to use much RAM (at least right after starting). Doesn't this line in the docker-compose control the RAM usage?ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted ...
Could be clean log after restart. Unfortunately, I restarted the server right away 😞 I gonna post if it happens again with the appropriate logs.
Thank you very much, didnt know about that 🙂
Yea, is there a guarantee that the clearml-agent will not crash because it did not clean the cache in time?
But the problems seem to be reoccuring
pytorch.tensorboard is the same as tensorboardx https://github.com/pytorch/pytorch/blob/6d45d7a6c331ddb856ac34a76bcd3613aa05185b/torch/utils/tensorboard/summary.py#L461
Good idea. No, clearml-agent does not crash and works fine afterwards. Then it is probably some other problem with my machine. Thank you!
Ah, perfect. Did not know this. Will try! Thanks again! 🙂
SweetBadger76 I am using the Cleanup Service
AgitatedDove14 Thank you, that explains it.
I can put anything there: s3://my_minio_instance:9000 /bucket_that_does_not_exist
and it will work.
To summarize: The scheduler should assign tasks the the agent first, which gives a queue the highest priority.
What I get for args
when I print it locally is not the same as what ClearML extracts from args
.