Reputation
Badges 1
979 × Eureka!If I remove security_group_ids
and just let subnet_id
in the configuration, it is not taken into account (the instances are created in a the default subnet)
and in the logs:
`
agent.worker_name = worker1
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.torch_nightly = false
agent.venvs_dir = /...
With a large enough number of iterations in the for loop, you should see the memory grow over time
oh seems like it is not synced, thank you for noticing (it will be taken care immediately)
Thank you!
does not contain a specific wheel for cuda117 to x86, they use the pip defualt one
Yes so indeed they don't provide support for earlier cuda versions on latest torch versions. But I should still be able to install torch==1.11.0+cu115 even if I have cu117. Before that is what the clearml-agent was doing
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
I see 3 agents in the "Workers" tab
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
TimelyPenguin76 , no, Iβve only set the sdk.aws.s3.region = eu-central-1
param
Yea I really need that feature, I need to move away from key/secrets to iam roles
Ping CostlyOstrich36 AgitatedDove14 SuccessfulKoala55 Just making sure this wasn't missed π
It could be yes but the difference between now
and last_report_time
doesnβt match the difference I observe
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)
I tried removing type=str but I got same problem π
in the UI the value is correct one (not empty, a string)
Ok, so after updating to trains==0.16.2rc0, my problem is different: when I clone a task, update its script and enqueue it, it does not have any Hyper-parameters/argv section in the UI
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
AgitatedDove14 So what you are saying is that since I have trains-server 0.16.1, I should use trains>=0.16.1? And what about trains-agent? Only version 0.16 is released atm, this is the one I use
Hi TimelyPenguin76 ,
trains-server: 0.16.1-320
trains: 0.15.1
trains-agent: 0.16
Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
I mean that I have a taskA (controller) that is in charge of creating a taskB with the same argv parameters (I just change the entry point of taskB)
This is how I start the agent that is running the two experiments in parallel:python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached
ok, what is the 3.8 release? a server release? how does this number relates to the numbers above?
when can we expect the next self hosted release btw?
I hit F12 to check projects.get_all_ex
but nothing is fired, I guess the web ui is just frozen in some weird state
btw CostlyOstrich36 , I can see in Profile > Version: 1.1.1-135 β’ 1.1.1 β’ 2.14
. What these numbers correspond to?