Reputation
Badges 1
979 × Eureka!even if I move the Github workers internally where they could have access to the prod server, I am not sure I would like that, because it would pile up test data in the prod server that is not necessary
Hi SuccessfulKoala55 , How can I now if I log in in this free access mode? I assume it is since in the login page I only see login field, not password field
wow if this works that’s amazing
so that any error that could arise from communication with the server could be tested
I want in my CI tests to reproduce a run in an agent because the env changes and some things break in agents and not locally
This is not the case, I downloaded it and I got a cuda error at runtime
I think the best case scenario would be that ClearML maintains a github action that sets up a dummy clearml-server, so that anyone can use it as a basis to run their tests, so that they just have to change to URL of the server to the local one executed in the github action and they can test seamlessly all their code, wdyt?
and I didn't have this problem before because when cu117 wheels were not available, the agent was trying to get the wheel with the closest cu version and was falling back to 1.11.0+cu115, and this one was working
Also I can simply delete the /elastic_7 folder, I don’t use it anymore (I have a remote ES cluster). In that case, I guess I would have enough space?
Hey @<1523701205467926528:profile|AgitatedDove14> , Actually I just realised that I was confused by the fact that when the task is reset, because of the sorting it disappears, making it seem like it was deleted. I think it's a UX issue: When I click on reset.
- The pop shows "Deleting 100%"
- The task disappears in the list of tasks because of the sortingThis led me to thing that there was a bug and the task was deleted
Ok AgitatedDove14 SuccessfulKoala55 I made some progress in my investigation:
I can exactly pinpoint the change that introduced the bug, it is the one changing the endpoint "events.get_task_log", min_version="2.9"
In the firefox console > Network, I can edit an events.get_task_log
and change the URL from …/api/v2.9/events.get_task_log
to …/api/v2.8/events.get_task_log
(to use the endpoint "events.get_task_log", min_version="1.7"
) and then all the logs are ...
ok, so even if that guy is attached, it doesn’t report the scalars
ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another 🙂
I also did run sudo apt install nvidia-cuda-toolkit
AgitatedDove14 After investigation, another program on the machine consumed all the memory available, most likely making the OS killing the agent/task
Downloading the artifacts is done only when actually calling get()/get_local_copy()
Yes, I rather meant: reproduce this behavior even for getting metadata on the artifacts 🙂
I specified a torch @
https://download.pytorch.org/whl/cu100/torch-1.3.1%2Bcu100-cp36-cp36m-linux_x86_64.whl and it didn't detect the link, it tried to install latest version: 1.6.0
you mean “docker” was not installed and it did not throw an error ?
Yes docker was not installed in the machine
Yes you must make sure the docker can mount a persistent folder for you to work on.
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
I actually need to be able to overwrite files, so in my case it makes sense to give the Deleteobject permission in s3. But for other cases, why not simply catch this error, display a warning to the user and store internally that delete is not possible?
python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached --gpus 1 > ~/trains-agent.startup.log 2>&1
UnevenDolphin73 , task = clearml.Task.get_task(clearml.config.get_remote_task_id())
worked, thanks