Reputation
Badges 1
981 × Eureka!Downloading the artifacts is done only when actually calling get()/get_local_copy()
Yes, I rather meant: reproduce this behavior even for getting metadata on the artifacts 🙂
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
same as the first one described
I am already trying with latest of pip 😞
I am sorry to give infos that are not very precise, but it’s the best I can do - Is this bug happening only to me?
So in my minimal reproducable example, it does work 🤣 very frustrating, I will continue searching for that nasty bug
Oof now I cannot start the second controller in the services queue on the same second machine, it fails with
` Processing /tmp/build/80754af9/cffi_1605538068321/work
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1605538068321/work'
clearml_agent: ERROR: Could not install task requirements!
Command '['/home/machine/.clearml/venvs-builds.1.3/3.6/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r'...
CostlyOstrich36 I updated both agents to 1.1.2 and still go the same problem unfortunately. Since I can download the full log file from the Web UI, I guess the agents are reporting correctly?
Could it be that the elasticsearch does not return all the requested logs when it is queried from the WebUI to display it in the console?
Now that I think about it, I remember that on the changelog of the clearml-server 1.2.0 the following is listed:
` Fix UI Workers & Queues and Experiment Table pages ...
Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))
Hi CostlyOstrich36 , this weekend I took a look at the diffs with the previous version ( https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0# ) and I saw several changes related to the scrolling/logging:
apiserver/bll/event/ http://log_events_iterator.py apiserver/bll/event/ http://events_iterator.py apiserver/config/default/services/_mongo.conf apiserver/database/model/ http://base.py apiserver/services/ http://events.pyI suspect that one of these changes might be responsible ...
Awesome! (Broken link in migration guide, step 3: https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/ )
No, they have different names - I will try to update both agents to the latest versions
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didn’t change anything else
Hi CostlyOstrich36 , one more observation: it looks like when I don’t open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I don’t see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `
Here I have to do it for each task, is there a way to do it for all tasks at once?
Here is the minimal reproducable example.
Run test_task_a.py - It will register a dummy artifact, create a new task, set a parameter in that task and enqueue it test_task_b will try to retrieve parameter from parent task and fail
I fixed, will push a fix in pytorch-ignite 🙂
--- /data ---------- 48.4 GiB [##########] /elastic_7 1.8 GiB [ ] /shared 879.1 MiB [ ] /fileserver . 163.5 MiB [ ] /clearml_cache . 38.6 MiB [ ] /mongo 8.0 KiB [ ] /redis
Hi, /opt/clearml is ~40Mb, /opt/clearml/data is about ~50gb
In all the steps I want to store them as artifacts to s3 because it’s very convenient.
The last step should merge them all, ie. it needs to know all the other artifacts of the previous steps
Hi CostlyOstrich36 , there was no DB migration necessary since 1.6, right?
mmmh it fails, but if I connect to the instance and execute ulimit -n , I do see65535while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))Which gives me in the logs of the task:b'1024'So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
Interesting - I can reproduce easily
Also what is the benefit of having by default index.number_of_shards = 1 for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
how would it interact with the clearml-server api service? would it be completely transparent?
it would be nice if Task.connect_configuration could support custom yaml file readers for me
I just checked if something changed in https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_config.html#web-login-authentication