Reputation
Badges 1
981 × Eureka!SuccessfulKoala55 Since 2 hours I get 504 errors and I cannot ssh into the machine. AWS reports that instance health checks fail. Is it safe to restart the instance?
(Btw the instance listed in the console has no name, it it normal?)
Still getting the same error, it is not taken into account 🤔
and this works. However, without the trick from UnevenDolphin73 , the following won’t work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
What I mean is that I don't need to have cudatoolkit installed in the current conda env, right?
edited the aws_auto_scaler.py, actually I think it’s just a typo, I just need to double the brackets
For me it is definitely reproducible 😄 But the codebase is quite large, I cannot share. The gist is the following:
import matplotlib.pyplot as plt
import numpy as np
from clearml import Task
from tqdm import tqdm
task = Task.init("Debug memory leak", "reproduce")
def plot_data():
fig, ax = plt.subplots(1, 1)
t = np.arange(0., 5., 0.2)
ax.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
return fig
for i in tqdm(range(1000), total=1000):
fig = plot_data()
...
ok, now I actually remember why I used _update_requirements instead of add_requirements: The first overwrites all the other, the later only add to the already detected packages. Since my deps are listed in the dependencies of my setup.py, I don't want clearml to list the dependencies of the current environment
SInce it fails on the first machine (clearml-server), I try to run it on another, on-prem machine (also used as an agent)
and in the logs:
`
agent.worker_name = worker1
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.torch_nightly = false
agent.venvs_dir = /...
interestingly, it works on one machine, but not on another one
I think clearml-agent tries to execute /usr/bon/python3.6 to start the task, instead of using the python used to start clearml-agent
CostlyOstrich36 , actually this only happens for a single agent. The weird thing is that I have a machine with two gpus, and I spawn two agents, one per gpus. Both have the same version. For one, I can see all the logs, but not for the other
Thats how I would do it, maybe guys from allegro-ai can come up with a better approach 👍
Can I simply set agent.python_binary = path/to/conda/python3.6 ?
CostlyOstrich36 yes, when I scroll up, a new events.get_task_log is fired and the response doesn’t contain any log (but it should)
Ok, deleting installed packages list worked for the first task
CostlyOstrich36 , this also happens with clearml-agent 1.1.1 on a aws instance…
SuccessfulKoala55 I tried to setup in a different machine the clearml-agent and now I get a different error message in the logs:Warning: could not locate requested Python version 3.6, reverting to version 3.6 clearml_agent: ERROR: Python executable with version '3.6' defined in configuration file, key 'agent.default_python', not found in path, tried: ('python3.6', 'python3', 'python')
in clearml.conf:agent.package_manager.system_site_packages = true agent.package_manager.pip_version = "==20.2.3"
I mean when sending data from the clearml-agents, does it block the training while sending metrics or is it done in parallel from the main thread?
I actually need to be able to overwrite files, so in my case it makes sense to give the Deleteobject permission in s3. But for other cases, why not simply catch this error, display a warning to the user and store internally that delete is not possible?