Reputation
Badges 1
979 × Eureka!AgitatedDove14 SuccessfulKoala55 I just saw that clearml-server 1.4.0 was released, congrats ๐ ๐ Was this bug fixed with this new version?
clearml doesn't change the matplotlib backend under the hood, right? Just making sure ๐
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
Yes, I would like to update all references to the old bucket unfortunatelyโฆ I think Iโll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I donโt have to mess with clearml data - I am afraid to do something wrong and loose data
For me it is definitely reproducible ๐ But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
Hi SuccessfulKoala55 , will I be able to update all references to the old s3 bucket using this command?
If I remove security_group_ids
and just let subnet_id
in the configuration, it is not taken into account (the instances are created in a the default subnet)
and in the logs:
`
agent.worker_name = worker1
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.torch_nightly = false
agent.venvs_dir = /...
With a large enough number of iterations in the for loop, you should see the memory grow over time
Is there one?
No, I rather wanted to understand how it worked behind the scene ๐
The latest RC (0.17.5rc6) moved all logs into separate subprocess to improve speed with pytorch dataloaders
Thatโs awesome!
oh seems like it is not synced, thank you for noticing (it will be taken care immediately)
Thank you!
does not contain a specific wheel for cuda117 to x86, they use the pip defualt one
Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
RuntimeError: CUDA error: no kernel image is available for execution on the device
So the wheel that was working for me was this one: [torch-1.11.0+cu115-cp38-cp38-linux_x86_64.whl](https://download.pytorch.org/whl/cu115/torch-1.11.0%2Bcu115-cp38-cp38-linux_x86_64.whl)
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117. It just happens that this wheel doesn't work in ec2 g5 instances suprizingly. Either I'll hardcode the correct wheel or I'll upgrade torch to 1.13.0
This is not the case, I downloaded it and I got a cuda error at runtime
Ha I just saw in the logs:
WARNING:py.warnings:/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA A10G with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at
I wouldn't do it, this is less code to maintain from your side and honestly too much auto magic makes it difficult for the user to control the environment (ie. to understand what happens behind the scenes). I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible
@<1537605940121964544:profile|EnthusiasticShrimp49> I'll try setting the cuda version clearml.conf, thanks for the tip!
@<1523701205467926528:profile|AgitatedDove14> Could you please push the code for that version on github?
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
I see 3 agents in the "Workers" tab