Reputation
Badges 1
25 × Eureka!Hmm could you try to upload to your files server (not the S3)
Maybe some credentials error ?
BTW: the cloning error is actually the wrong branch, if you take a look at your initial screenshot, you can see the line before last branch='default'
which I assume should be branch='master'
(The error itself is still weird, but I assume that this is what git is returning)
sdk.conf will add it to the default loaded values (as I think you deduced).
can copy paste the sdk.conf here? (maybe something is missing there?)
or by trains
We just upload the image as is ... I think this is SummaryWriter issue
I want to use services queue for running services, and I want to do it on k8s
So yes, as a standalone pod with the agent in venv mode (as opposed to docker mode)
Does that make sense to you?
SmoothArcticwolf58 could you copy paste the entire query and what is the expected results vs reality ?
You mean for running a worker? (I think plain vanilla python / ubuntu works)
The only change would be pip install clearml / clearml-agent ...
Hmm so I guess the actual code adds it into the reporting itself ...
How about we call:task.set_initial_iteration(0)
Hi DilapidatedDucks58
apologies, this thread slipped way.
I double checked, there server will not allow you to overwrite it (meaning to have it fixed will need to release a server version which usually takes longer)
That said maybe we can pass an argument to the "Task.init" so it ignores it? wdyt?
Okay, I was able to reproduce, this will only happen if you are running from a daemon process (like in the case of a process pool), Python is sometimes very picky when it comes to multi-threading/processes I'll check what we can do π
HI @<1687643893996195840:profile|RoundCat60>
Are you running on AWS ?
Hi JollyChimpanzee19
What are the versions (clearml , TF , PT), also could you add one more line from the stack (I.e. which call triggered the exception)
Thatβs the question i want to raise too,
No file size limit
Let me try to run it myself
Hi @<1533620191232004096:profile|NuttyLobster9>
First nice workaround!
Second could you send the full log? When the venv is skipped then pytorch resolving should be skipped as well, and no error should be raised...
And Lastly could you also send the log of the task that executed correctly (the one you cloned), because you are correct it should have been the same
(also im a bit newer to this world, whats wrong with openshift?)
It's the most difficulty Kubernetes flavor to work with π
weve already tried that but it didnt really change ...
Can you provide full log? as well as how you created the pods ?
ouch, I think you are correct, can you test a fix?
however if I want multiple machines syncing with the optimizer, for pulling the sampled hyper parameters and reporting results, I can't see how it would work
I have to admit, this is where I'm loosing you.
I thought you wanted to avoid the agent, since you wanted to run everything locally, wasn't that the issue ?
Maybe there is some background missing here, let me see if I can explain how the optimizer works.
In your actual training code you have something like:` params = {'lr': 0.3, ...
Hi DeterminedToad86
I just verified on a clean sagemaker instance everything should just work, see here: https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution Yes if you have more than one file (either notebook or python script) than you must have a git repo, in order to run the task using the Agent.
FriendlySquid61 could you help?
Simple git clone on that repo works well
On the machine running the trains-agent ?
Hmm, I really like this one:
https://chart-studio.plotly.com/~empet/14632/plotly-joyplotridgelines/#plot
What I'm thinking is a global setting basically telling the TB binding layer to always do ridgeline instead of 3d surface.
wdyt?
Will using Model.remove, completely delete from storage as well?Β (edited)
correct see argument delete_weights_file=True
Hi JitteryCoyote63
Somehow I thought it was solved π
1 ) Yes please add GitHub issue so we can keep track
2 )
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Is this the main issue ?
Ok no it only helps if as far as I don't log the figure.
you mean if you create the natplotlib figure and no automagic connect you still see the mem leak ?
You might be able to write a script to override the links ... wdyt?
The easiest is to pass an entire trains.conf
file
Hi LackadaisicalOtter14
Is it possible to remove this line to stop it from being executed
Everything is possible π II think the main question is why it is there (which ti the best of my understanding, is to solve for any cuda drivers and installed packages, meaning anything that is installed in runtime)
I think we can suppress the error, wdyt?'echo "ldconfig" 2>/dev/null >> /etc/profile && '
When I passed specific arguments (for example --steps) it ignored them...
script.py test blah1 blah2 blah3 42
Is this how it is intended to be used ?