Reputation
Badges 1
25 × Eureka!Thanks GrievingTurkey78 !
It seems that under the hood they user argparser
See here:
https://github.com/google/python-fire/blob/c507c093fa6622ab5efee21709ffbf25974e4cf7/fire/parser.py
Which means it might just work?!
What do you think?
Hmm there was this one:
https://github.com/allegroai/clearml/commit/f3d42d0a531db13b1bacbf0977de6480fedce7f6
Basically always caching steps (hence the skip), you can install from the main branch to verify this is the issue. an RC is due in a few days (it was already supposed to be out but got a bit delayed)
I think you are correct, None values should be listed as empty values not the String None.
What's the clearml version you are using? And could you retest with the latest RC?
We actually plan to create different queues for different types of workloads, we are a bit seeing what the actual usage is to define what type of workloads make sense for us.
That sounds like a great path to take, it will make it very clear fro users on what they will be getting and why they should use specific queue.
As for the memory, yes the reasoning is clear, the main thing we'll have to see is hot define the limits, because we have nodes with quite different resources availab...
ChubbyLouse32 and this works when running python code and not when the agent is running ?
On the same machine ?
Hi CurvedDolphin95
I would first check the free space on the instance (it might be that git is reporting an inaccurate error and it's free space not permission that causing it to fail the clone).
I would also check your GitHub account, notice that the now only support user/api-key (and not user/pass), which means you need to create an api-key and add it as your password in the clearml.conf.
Any chance that for some reason some of the Tasks are running from a diff user? or not using a docker ?
WickedGoat98 this is awesome! Let me know how I could help 🙂
BTW: I checked regrading the plot comparison, this is a BE issue due to the size of the plot, I was told a fix will be deployed in a day or two.
Do you have two agents pulling from the same queue ?
Maybe one of them is configured differently ?
That depends on the HPO algorithm, basically the will be pushed based on the limit of "concurrent jobs", so you do not end up exploding the queue. It also might be a Bayesian process, i.e. based on previous set of parameters and runs, like how hyper-band works (optuna/hpbandster)
Make sense ?
Internally we use blob.upload_from_file it has a default 60sec timeout on the connection (I'm assuming the upload could take longer).
DeliciousSeal67
are we talking about the agent failing to install the package ?
MiniatureHawk42 could you re-test with the latest from github?pip install -U git+
Hi RipeGoose2
Could you expand on "inconsistency in the iteration reporting" ? Also "calling trainer.fit multiple" would you expect it to show as a single experiment or is it kind of param search ?
if you have cuda 10.2, then the torch 1.3.1 from the cu101 version should work
Or did you mean I can couple a short "mini config" with the package and redirect clearml to use this local one (instead of the one at ~/clearml.conf)?
Actually yes, you can set a "fixed" config point to it with ENV variable, then setup per user just the access/secret .
wdyt?
(I was also pointing to the fact you do not have to use clearml-init you can create a simple partial config template and let user just fill in the missing "key"/"secret")
While I'll look into it, you can do:from clearml import OutputModel output_model = OutputModel() output_model.update_weights("best_model.onnx")
Hey GiganticTurtle0 ,
So basically the issue is the the pipeline function ( prediction_service ) is getting a dict as input, and it is expecting to get basic types... if you were to do the following, it would have worked as expected.prediction_service(**default_config)I will make sure we flatten any dictionary so that we end up with config/start , instead of a serialized version of the dict.
wdyt?
Right, you need to pass "repo" and direct it to the repository path
(BTW, what's the cleaml version)
EnviousPanda91 please feel free to PR if it works 🙂
https://github.com/allegroai/clearml/blob/86586fbf35d6bdfbf96b6ee3e0068eac3e6c0979/clearml/binding/frameworks/catboost_bind.py#L114
Yes! Thanks so much for the quick turnaround
My pleasure 🙂
BTW: did you see this (it seems like the same bug?!)
https://github.com/allegroai/clearml-helm-charts/blob/0871e7383130411694482468c228c987b0f47753/charts/clearml-agent/templates/agentk8sglue-configmap.yaml#L14
I gather there's a distinction between the two, with app.clear being the public cloud-based SaaS version
My apologies SmallDeer34 , this is all some legacy domain stuff
actually " http://app.pro.clear.ml ," is not used any longer (although up), and will be removed in the future
SaaS free/pro is the same domain ( http://app.clear.ml ), same accounts, the only difference is whether you added a credit card, other than that it is the same domain and access.
does that make sense ?
MysteriousBee56 yes, please change the trains code!!! Wee pee, if you think someone else can benefit, feel free to PR :)
Regrading the double entry, that seems like an odd bug, how can I reproduce it?
Hi TeenyFly97
Can I super-impose the graphs while comparing experiments?
Hmm not at the moment, I think someone asked for the option to control it, in both comparison mode and "standalone" mode.
There is a long discussion on this feature here:
https://github.com/allegroai/trains/issues/81#issuecomment-645425450
Feel free to chime in 🙂
I think that the latest agreement is a switch in the UI, separating or collecting (super-imposing) those graphs.
Thanks FlutteringWorm14 , checking 🙂
Hmm I see, add this for example
extra_docker_shell_script: ["rm ~/.bashrc", "echo removed bashrc"]
You can always log it manually:from clearml import InputModel input_model = InputModel.import_model(weights_url='/tmp/keras_example/weight.6.hdf5')
Hi @<1541592204353474560:profile|GhastlySeaurchin98>
During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:
This looks like the HPO algorithm doing early stopping, which algo are you using ?
Bake to the error:
clearml_agent: ERROR: Failed getting token (error 401 from
): Unauthorized (invalid credentials) (failed to locate provided credentials)
See here:
https://github.com/allegroai/clearml-server/blob/3f2b96266bc51bfce680bd759c7fa9d635ae36d3/docker/docker-compose.yml#L131
You need to provide an access key so it can actually "talk" to the server next to it.
Do you need to control the cuda drivers ?