Reputation
Badges 1
981 × Eureka!Hi CostlyOstrich36 , this weekend I took a look at the diffs with the previous version ( https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0# ) and I saw several changes related to the scrolling/logging:
apiserver/bll/event/ http://log_events_iterator.py apiserver/bll/event/ http://events_iterator.py apiserver/config/default/services/_mongo.conf apiserver/database/model/ http://base.py apiserver/services/ http://events.pyI suspect that one of these changes might be responsible ...
Awesome! (Broken link in migration guide, step 3: https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/ )
No, they have different names - I will try to update both agents to the latest versions
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didnât change anything else
Hi CostlyOstrich36 , one more observation: it looks like when I donât open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I donât see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `
I opened an https://github.com/pytorch/ignite/issues/2343 in igniteâs repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init in distributed envs
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
There is no way to filter on long types? I canât believe it
Mmmh unfortunately not easily⌠I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init with Task.get_task so that Task.current_task is the same task as the output of Task.get_task
Thanks a lot, I will play with that!
I still don't see why you would change the type of the cloned Task, I'm assuming the original Task had the correct type, no?
Because it is easier for me that I create a training task out of the controller task by cloning it (so that parameters are prefilled and I can set the parent task id)
So the migration from one server to another + adding new accounts with password worked, thanks for your help!
I donât have a registry to push my image to.I think I can get around it actually - Will it work if I just build the image locally once, then start the agent? Docker would recognise that image locally and just use it right? I wonât need to update that image often anyway
it worked for the other folder, so I assume yes --> I archived the /opt/trains/data/mongo, sent the archive via scp, unarchived, updated the rights and now it works
but most likely I need to update the perms of /data as well
You are right, thanks! I was trying to move /opt/trains/data to an external disk, mounted at /data
Ok, now I would like to copy from one machine to another via scp, so I copied the whole /opt/trains/data folder, but I got the following errors:
AgitatedDove14 So I copied pasted locally the https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py from the examples of pytorch-ignite. Then I added a requirements.txt and called clearml-task to run it on one of my agents. I adapted a bit the script (removed python-fire since itâs not yet supported by clearml).
So I created a symlink in /opt/train/data -> /data
Ho nice, thanks for pointing this out!
Yes, but I am not certain how: I just deleted the /data folder and restarted the server
Here I have to do it for each task, is there a way to do it for all tasks at once?
Here is the minimal reproducable example.
Run test_task_a.py - It will register a dummy artifact, create a new task, set a parameter in that task and enqueue it test_task_b will try to retrieve parameter from parent task and fail
I fixed, will push a fix in pytorch-ignite đ
--- /data ---------- 48.4 GiB [##########] /elastic_7 1.8 GiB [ ] /shared 879.1 MiB [ ] /fileserver . 163.5 MiB [ ] /clearml_cache . 38.6 MiB [ ] /mongo 8.0 KiB [ ] /redis