Reputation
Badges 1
981 × Eureka!Would adding a ILM (index lifecycle management) be an appropriate solution?
mmmh it fails, but if I connect to the instance and execute ulimit -n , I do see65535while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))Which gives me in the logs of the task:b'1024'So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
Could be also related to https://allegroai-trains.slack.com/archives/CTK20V944/p1597928652031300
ok, thanks SuccessfulKoala55 !
here is the function used to create the task:
` def schedule_task(parent_task: Task,
task_type: str = None,
entry_point: str = None,
force_requirements: List[str] = None,
queue_name="default",
working_dir: str = ".",
extra_params=None,
wait_for_status: bool = False,
raise_on_status: Iterable[Task.TaskStatusEnum] = (Task.TaskStatusEnum.failed, Task.Ta...
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
Very good job! One note: in this version of the web-server, the experiments logo types are all blank, what was the reason to change them? Having a color code in the logos helps a lot to quickly check the nature of the different experiments tasks, isnt it?
Hi NonchalantHedgehong19 , thanks for the hint! what should be the content of the requirement file then? Can I specify my local package inside? how?
I did change the replica setting on the same index yes, I reverted it back from 1 to 0 afterwards
In the comparison the problem will be the same, right? If I choose last/min/max values, it wonโt tell me the corresponding values for others metrics. I could switch to graphs, group by metric and look manually for the corresponding values, but that becomes quickly cumbersome as the number of experiments compared grow
yes what happens in the case of the installation with pip wheels files?
Sure ๐ Opened https://github.com/allegroai/clearml/issues/568
Alright I have a followup question then: I used the param --user-folder โ~/projects/my-projectโ, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
I managed to do it by using logger.report_scalar, thanks!
basically:
` from trains import Task
task = Task.init("test", "test", "controller")
task.upload_artifact("test-artifact", dict(foo="bar"))
cloned_task = Task.clone(task, name="test", parent=task.task_id)
cloned_task.data.script.entry_point = "test_task_b.py"
cloned_task._update_script(cloned_task.data.script)
cloned_task.set_parameters(**{"artifact_name": "test-artifact"})
Task.enqueue(cloned_task, queue_name="default") `
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
AgitatedDove14 Yes that might work, also the first one (with conda) might work as well, I will give it a try, thanks!
AgitatedDove14 After investigation, another program on the machine consumed all the memory available, most likely making the OS killing the agent/task
With a large enough number of iterations in the for loop, you should see the memory grow over time
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking ๐
Hi SuccessfulKoala55 , there it is > https://github.com/allegroai/clearml-server/issues/100
Restarting the server ( docker-compose down then docker-compose up ) solved the problem ๐ All experiments are back
So it looks like the agent, from time to time thinks it is not running an experiment
torch==1.7.1 git+ .
wow if this works thatโs amazing
I also would like to avoid any copy of these artifacts on s3 (to avoid double costs, since some folders might be big)