
Reputation
Badges 1
970 × Eureka!Or even better: would it be possible to have a support for HTML files as artifacts?
Is it because I did not specify --gpu 0
that the agent, by default pulls one experiment per available GPU?
When an experiment on trains-agent-1 is finished, I see randomly no experiment/long experiment and when two experiments are running, I see randomly one of the two experiments
Some more context: the second experiment finished and now, in the UI, in workers&queues tab, I see randomlytrains-agent-1 | - | - | - | ... (refresh page) trains-agent-1 | long-experiment | 12h | 72000 |
So it looks like the agent, from time to time thinks it is not running an experiment
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
Is there a typo in your message? I don't see the difference between what I wrote and what you suggested: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Note: I can verify that post_packages is well picked up by the trains-agent, since in the experiment log I see:agent.package_manager.type = pip agent.package_manager.pip_version = \=\=20.2.3 agent.package_manager.system_site_packages = true agent.package_manager.force_upgrade = false agent.package_manager.post_packages.0 = PyJWT\=\=1.7.1
so most likely one hard requirement installs the version 2 of pyjwt while setting up the experiment
but the post_packages does not reinstalls the version 1.7.1
Ok, by setting PyJWT==1.7.1
in the setup.py of the experiment pip did not enforced the update
yes -> but I still don't understand why the post_packages didn't work, could be worth investigating
hooo now I understand, thanks for clarifying AgitatedDove14 !
AgitatedDove14 So in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping class I see that some infos are logged (in the __call__
function), and I would like to have these infos logged by clearml
AgitatedDove14 WOW, thanks a lot! I will dig into that 🚀
I can ssh into the agent and:source /trains-agent-venv/bin/activate (trains_agent_venv) pip show pyjwt Version: 1.7.1
here is the function used to create the task:
` def schedule_task(parent_task: Task,
task_type: str = None,
entry_point: str = None,
force_requirements: List[str] = None,
queue_name="default",
working_dir: str = ".",
extra_params=None,
wait_for_status: bool = False,
raise_on_status: Iterable[Task.TaskStatusEnum] = (Task.TaskStatusEnum.failed, Task.Ta...
ho wait, actually I am wrong
The task I cloned from is not the one I though
as for disk space: I have 21Gb available (8Gb used), /opt/trains/data folder is about 600Mo
AppetizingMouse58 After some thoughts, we decided to install from scratch 0.16, with no data migration, because we believe this was an edge case not worth spending efforts on. Thank you very much for your help there, very appreciated. You guys rock! 🙂
I actually need to be able to overwrite files, so in my case it makes sense to give the Deleteobject permission in s3. But for other cases, why not simply catch this error, display a warning to the user and store internally that delete is not possible?