Reputation
Badges 1
981 × Eureka!I’ve reindexed the data for the logs, now the mappings are correct but I am missing one month of data, I have literally no idea where this data is/how it disappeared
Here are the logs of the agent :)
` (base) user@worker:~$ tail -f /tmp/.clearml_agent_daemon_outjdups8t2.txt
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
+----------------------------------+--------+-------+
| id | name | tags |
+----------------------------------+--------+-------+
| 54e4a62a402d5135612ba7b12cfe4e57 | docker | |
+----------------------------------+--------+-------+
Starting infinite tas...
btw task._get_task_property('hyperparams') also gives me ValueError: Task has no hyperparams section defined
(I didn't have this problem so far because I was using ssh keys globaly, but I want know to switch to git auth using Personal Access Token for security reasons)
btw I see in the pytorch_distributed_example I see that you average_gradients , but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
Stopping the server Editing the docker-compose.yml file, adding the logging section to all services Restarting the serverDocker-compose freed 10Go of logs
The parent task is a data_processing task, therefore I retrieve it so that I can then data_processed = parent_task.artifacts["data_processed"]
There is no way to filter on long types? I can’t believe it
Yes, it works now! Yay!
and saved locally, which is why the second task, not executed in the same machine, cannot access the file
Thats how I would do it, maybe guys from allegro-ai can come up with a better approach 👍
Ok thanks! And for this?
Would it be possible to support such use case? (have the clearml-agent setting-up a different python version when a task needs it?)
Sure, it’s because of a very annoying bug that I shared in this https://clearml.slack.com/archives/CTK20V944/p1648647503942759 , that I couldn’t solve so far.
I’m not sure you can downgrade that easily ...
Yea that’s what I thought, that’s a bit of pain for me now, I hope I can find a way to fix the bug somehow
That’s why I said “not really” 😄
Thanks for your input TenseOstrich47 , I was considering using a secret manager now, I guess that's the best option. I can move the secrets wherever I need them to be to make it work 🙂
Hi SmugDolphin23 thanks for the input! Will try now but that seems hacky: to have it working I have to specify python3.8 two times:
one in the agent config file (agent.default_python is already python3.8, but seems to be ignored) + make sure it is available (using python:3.8 docker image)Is there a way to prevent this redundancy? Ie. If I want to change the python version, I can control it from a single place?
SuccessfulKoala55 They do have the right filepath, eg:https://***.com:8081/my-project-name/experiment_name.b1fd9df5f4d7488f96d928e9a3ab7ad4/metrics/metric_name/predictions/sample_00000001.png
Try to spin up the instance of that type manually in that region to see if it is available
Thanks! Unfortunately still not working, here is the log file:
Add carriage return flush support using the sdk.development.worker.console_cr_flush_period configuration setting (GitHub trains Issue 181)
Hi AgitatedDove14 , that’s super exciting news! 🤩 🚀
Regarding the two outstanding points:
In my case, I’d maintain a client python package that takes care of the pre/post processing of each request, so that I only send the raw data to the inference service and I post process the raw output of the model returned by the inference service. But I understand why it might be desirable for the users to have these steps happening on the server. What is challenging in this context? Defining how t...
I am already trying with latest of pip 😞
AgitatedDove14 I eventually found a different way of achieving what I needed
And now that I restarted the server and went back into the project where I initially deleted the archived experiments, some of them are still there - I will leave them alone, too scared to do anything now 😄
AgitatedDove14 So what you are saying is that since I have trains-server 0.16.1, I should use trains>=0.16.1? And what about trains-agent? Only version 0.16 is released atm, this is the one I use
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler