Reputation
Badges 1
979 × Eureka!So the wheel that was working for me was this one: [torch-1.11.0+cu115-cp38-cp38-linux_x86_64.whl](https://download.pytorch.org/whl/cu115/torch-1.11.0%2Bcu115-cp38-cp38-linux_x86_64.whl)
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117. It just happens that this wheel doesn't work in ec2 g5 instances suprizingly. Either I'll hardcode the correct wheel or I'll upgrade torch to 1.13.0
This is not the case, I downloaded it and I got a cuda error at runtime
Ha I just saw in the logs:
WARNING:py.warnings:/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA A10G with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at
I wouldn't do it, this is less code to maintain from your side and honestly too much auto magic makes it difficult for the user to control the environment (ie. to understand what happens behind the scenes). I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible
@<1537605940121964544:profile|EnthusiasticShrimp49> I'll try setting the cuda version clearml.conf, thanks for the tip!
@<1523701205467926528:profile|AgitatedDove14> Could you please push the code for that version on github?
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
I see 3 agents in the "Workers" tab
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
TimelyPenguin76 , no, Iβve only set the sdk.aws.s3.region = eu-central-1
param
Yea I really need that feature, I need to move away from key/secrets to iam roles
SuccessfulKoala55 , This is not the exact corresponding request (I refreshed the tab since then), but the request is an events.get_task_logs
, with the following content:
Ping CostlyOstrich36 AgitatedDove14 SuccessfulKoala55 Just making sure this wasn't missed π
It could be yes but the difference between now
and last_report_time
doesnβt match the difference I observe
SuccessfulKoala55 Am I doing/saying something wrong regarding the problem of flushing every 5 secs (See my previous message)
I tried removing type=str but I got same problem π
in the UI the value is correct one (not empty, a string)
Ok, so after updating to trains==0.16.2rc0, my problem is different: when I clone a task, update its script and enqueue it, it does not have any Hyper-parameters/argv section in the UI
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
AgitatedDove14 So what you are saying is that since I have trains-server 0.16.1, I should use trains>=0.16.1? And what about trains-agent? Only version 0.16 is released atm, this is the one I use
Hi TimelyPenguin76 ,
trains-server: 0.16.1-320
trains: 0.15.1
trains-agent: 0.16
Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
I mean that I have a taskA (controller) that is in charge of creating a taskB with the same argv parameters (I just change the entry point of taskB)
This is how I start the agent that is running the two experiments in parallel:python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached
ok, what is the 3.8 release? a server release? how does this number relates to the numbers above?
when can we expect the next self hosted release btw?