
Reputation
Badges 1
25 × Eureka!MelancholyChicken65 what's the clearml-serving you are using ? (I believe this issue was fixed in 1.2)
GiganticTurtle0 is it in the same repository ?
If it is it should have detected the fact that it needs to analyze the entire repository (not just the standalone script, and then discover tensorflow)
@<1699955693882183680:profile|UpsetSeaturtle37> good progress, regrading the error, 0.15.0 is supposed to be out tomorrow, it includes a fix to that one.
BTW: can you run with --debug
Hi @<1576381444509405184:profile|ManiacalLizard2>
Yeah that should work, assuming credentials are set in your clearml.conf
BoredHedgehog47 you need to make sure "<path here>/train.py" also calls Task.init (again no need to worry about calling it twice with different project/name)
The Task.init call will make sure the auto-connect works.
BTW: if you do os.fork , then there is no need for the Task.init, the main difference is that POpen starts a whole new process, and we need to make sure the newly created process is auto-connected as well (i.e. calling Task.init)
Have to get glue setup, which I couldnβt understand fully, so thatβs a different topic
I suggest using the apply template setup (basically you provide a Job/Service template, and it uses that to setup k8s jobs based on the Tasks coming in from the specific queue)
Done!
Thanks
fatal: unable to find a suitable socket path; use --socket
Β )
I think that's the root cause, we should probably also add https://github.com/allegroai/trains-agent/issues/16
the only thing that missing is some plots on the clearml server (app ) when i got to the details of the train i cannot see the matrix confusion for example ( but its exists on the bucket )
How do you report the "matrix confusion" ? (I might have an idea on what's the difference)
SweetGiraffe8 no need to import it, any report to TB is automatically logged by ClearML π
And If I create myself a Pro account
Then you have the UI and implementation of both AWS & GCP autoscalers, am I missing something?
I think that what you need is to create an OutputModel , then call update weights file when you have the better model, this will also allow you to tag the model object. Would that help? Or would it make sense to use Task.models and count on the auto logging?
Hi QuaintPelican38
Assuming you have open the default SSH port 10022 on the ec2 instance (and assuming the AWS premissions are set so that you can access it). You need to use the --public-ip
flag when running the clearml-session. Otherwise it "thinks" it is running on a local network and it registers itself with the local IP. With the flag on it gets the public IP of the machine, then the clearml-session running on your machine can connect to it.
Make sense ?
SubstantialElk6 could you add a github issue to set the direct url for the vscode as a parameter to the cleaml-session?
We already have --vscode-version
we could either extend it to include a direct url, or add a new argument.
wdyt ?
Hmm, any suggestion on making it more visible or on the interface ? (I mean deleting the cache file is always a solution, but it sounded quite painful to debug, hence the question)
@<1538330703932952576:profile|ThickSeaurchin47> can you try the artifacts example:
None
and in this line do:
task = Task.init(project_name='examples', task_name='Artifacts example', output_uri="
")
PanickyMoth78 'tensorboard_logger' is an old deprecated package that meant to create TB events without TB, it was created before TB was a separate package. Long story short, it is not supported. That said if you just run the same code and replace tensorboard_logger with tensorboard, you should see all scalars in the UI
background:
ClearML logs TB events as they are created in real-time, TB_logger is not TB, it creates events and dumps them directly into a TB equivalent event file
Hi WackyRabbit7 ,
Running in Docker mode provides you greater flexibility in terms of environment control, from switching cuda versions, to pre-compiled packages that are needed (think apt-get) etc. Specifically for DL if you are using multiple tensorflow versions, they are notorious for compiling against a specific CUDA version, and the only easy way to be able to switch between them would be different dockers. If your are a PyTorch user, then you are in luck, they have all the pytorch ver...
I did change the
instead of 8080?
So this is the issue
Hi BoredGoat1
from this warning: " TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
" It seems trains failed to load the nvidia .so dll that does the GPU monitoring:
This is based on pynvml, and I think it is trying to access "libnvidia-ml.so.1"
Basically saying, if you can run nvidima-smi from inside the container, it should work.
I get what you're saying. Only problem is in the case of AutoLogging, I don't have the model id, for the model being saved.
Task.models['output'] should return all the model objects the autologging created
SillyPuppy19 yes you are correct, actually I can promise you the callback will be called from a different thread (basically the monitoring thread) so it's on the user to make sure the callback can handle it .
How about we move this discussion to GitHub?
You might only see it when the upload is done
2023-02-15 12:49:22,813 - clearml - WARNING - Could not retrieve remote configuration named 'SSH'
This is fine, it means it uses the default identity keys
The thing is - when I try to connect with normal SSH there are no issues
Now I'm lost, so when exactly do you see the issue ?