Reputation
Badges 1
981 × Eureka!and saved locally, which is why the second task, not executed in the same machine, cannot access the file
Thats how I would do it, maybe guys from allegro-ai can come up with a better approach 👍
Ok thanks! And for this?
Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
Sure, it’s because of a very annoying bug that I shared in this https://clearml.slack.com/archives/CTK20V944/p1648647503942759 , that I couldn’t solve so far.
I’m not sure you can downgrade that easily ...
Yea that’s what I thought, that’s a bit of pain for me now, I hope I can find a way to fix the bug somehow
Actually I think I am approaching the problem from the wrong angle
That’s why I said “not really” 😄
Thanks for your input TenseOstrich47 , I was considering using a secret manager now, I guess that's the best option. I can move the secrets wherever I need them to be to make it work 🙂
Hi SmugDolphin23 thanks for the input! Will try now but that seems hacky: to have it working I have to specify python3.8 two times:
one in the agent config file (agent.default_python is already python3.8, but seems to be ignored) + make sure it is available (using python:3.8 docker image)Is there a way to prevent this redundancy? Ie. If I want to change the python version, I can control it from a single place?
SuccessfulKoala55 They do have the right filepath, eg:https://***.com:8081/my-project-name/experiment_name.b1fd9df5f4d7488f96d928e9a3ab7ad4/metrics/metric_name/predictions/sample_00000001.png
This works well when I run the agent in virtualenv mode (remove --docker )
Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...
Try to spin up the instance of that type manually in that region to see if it is available
Thanks! Unfortunately still not working, here is the log file:
Add carriage return flush support using the sdk.development.worker.console_cr_flush_period configuration setting (GitHub trains Issue 181)
Hi AgitatedDove14 , that’s super exciting news! 🤩 🚀
Regarding the two outstanding points:
In my case, I’d maintain a client python package that takes care of the pre/post processing of each request, so that I only send the raw data to the inference service and I post process the raw output of the model returned by the inference service. But I understand why it might be desirable for the users to have these steps happening on the server. What is challenging in this context? Defining how t...
I am already trying with latest of pip 😞
AgitatedDove14 I eventually found a different way of achieving what I needed
And now that I restarted the server and went back into the project where I initially deleted the archived experiments, some of them are still there - I will leave them alone, too scared to do anything now 😄
I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
AgitatedDove14 So what you are saying is that since I have trains-server 0.16.1, I should use trains>=0.16.1? And what about trains-agent? Only version 0.16 is released atm, this is the one I use
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler
/data/shared/miniconda3/bin/python /data/shared/miniconda3/bin/clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only
super, thanks SuccessfulKoala55 !
I didn’t use ignite callbacks, for future reference:
` early_stopping_handler = EarlyStopping(...)
def log_patience(_):
clearml_logger.report_scalar("patience", "early_stopping", early_stopping_handler.counter, engine.state.epoch)
engine.add_event_handler(Events.EPOCH_COMPLETED, early_stopping_handler)
engine.add_event_handler(Events.EPOCH_COMPLETED, log_patience) `
Hi TimelyPenguin76 ,
trains-server: 0.16.1-320
trains: 0.15.1
trains-agent: 0.16