Reputation
Badges 1
25 × Eureka!BTW: you still can get race/starvation cases... But at least no crash
CheerfulGorilla72
upd: I see NAN in the tensorboard, and 0 in Clearml.
I have to admit, since NaN's are actually skipped in the graph, should we actually log them ?
Hi CooperativeFox72 trains 0.16 is out, did it solve this issue? (btw: you can upgrade trains to 0.16 without upgrading the trains-server)
. Yes I do have a GOOGLE_APPLICATION_CREDENTIALS environment variable set, but nowhere do we save anything to GCS. The only usage is in the code which reads from BigQuery
Are you certain you have no artifacts on GS?
Are you saying that if GOOGLE_APPLICATION_CREDENTIALS and clearml.conf contains no "project" section it crashed when starting ?
Exporter would be nice I agree, not sure it is on the roadmap at the moment π
Should not be very complicated to implement if you want to take a stab at it.
The class documentation itself is also there under "References" -> "Trains Python Package"
Notice that due to a bug in the documentation (we are working on a fix) the reference part is not searchable in the main search bar
It will also allow you to pass them to Hydra (wither as overloaded, or directly edit the entire hydra config)
Hmm what do you mean? Isn't it under installed packages?
if the first task failed - then the remaining task are not schedule for execution which is what I expect.
agreed
I'm just surprised that if the first task is
aborted
instead by the user,
How is that different from failed? The assumption is if a component depends on another one it needs its output, if it does not then they can run in parallel. What am i missing?
Ohh then YES!
the Task will be closed by the process, and since the process is inside the Jupyter and the notebook kernel is running, it is still running
ClearML seems to store stuff that's relevant to script execution outside of clearml.Task
Outside of the cleaml.Task?
data it is going to s3 as well as ebs. Why so it should only go to s3
This sounds odd, if this is mounted then it goes to the S3 (the link will point to the files server, but it will be stored on the mounted drive i.e. S3)
wdyt?
:param list(str) xlabels: Labels per entry in each bucket in the histogram (vector), creating a set of labels for each histogram bar on the x-axis. (Optional)
Hi @<1798525199860109312:profile|IntriguedGoldfish14>
Yes the way to do that is just use the custom engine example as you did, also correct on the env var to add catboost to the container
You can of course create your own custom container from the base one and pre install any required package, to speedup the container spin time
One of the design decisions was to support multiple models from a single container, that means that there needs to be one environment for all of them, the main is...
How so? Installing a local package should work, what am I missing?
Hi CynicalBee90
Always great to have people joining the conversation, especially if they are the decision makers a.k.a can amend mistakes π
If I can summarize a few points here (and feel free to fill in / edit any mistake or leftovers)
Open-Source license: This is basically the mongodb license, which is as open as possible with the ability to, at the end, offer some protection against Amazon giants stealing APIs (like they did for both mongodb and elastic search) Platform & language agno...
Hi @<1523711619815706624:profile|StrangePelican34>
if I am trying to deploy 100 models on a GPU that can handle 5 concurrently,
Main limitation is Triton's ability to dynamically load / unload models. We know Nvidia is adding this capability, but I think this is still not out, once they support it, it should be transparent
that using a βlocalβ package is not supported
I see, I think the issue is actually pulling the git repo of the second local package, is that correct ?
(assuming you add the requirement manually, with Task.add_requirements) , is that correct ?
Can you do it manually, i.e. checkout the same commit id, then take the uncommitted changes (you can copy paste it to diff.txt) then call git apply diff.txt ?
Hi @<1523702786867335168:profile|AdventurousButterfly15>
I am running cross_validation, training a bunch of models in a loop like this:
Use the wildcard or disable all together:
task = Task.init(..., auto_connect_frameworks={"joblib": False})
You can also do
task = Task.init(..., auto_connect_frameworks={"joblib": ["realmodelonly.pkl", ]})
BroadMole98 Awesome, can't wait for your findings π
What's the jupyter / noetbook version you have?
Also from within the jupyter could you send me "sys.argv" ?
WackyRabbit7 my apologies for the lack of background in my answer π
Let me start from the top, one of the goal of the trains-agent is to reproduce the "original" execution environment. Once that is done, it will launch the code and monitor it. In order to reproduce the original execution environment, trains-agent will install all the needed python packages, pull the code, and apply the uncommitted changes.
If your entire environment is python based, then virtual-environment mode is proba...
but then the error occurs, after the training und the validating where succesfuly completed
It seems it is failing on the last eval ? could it be testing is missing? is it the same dataset ? can you verify the file is there? (notice I see a mix of / and \ in the file name, this is odd Windows is \ and linux/mac are / , you should never have a mix)
but I still have the problem if I try to run locally for debugging purposes
clearml-agent execute --id ...
Is this still an issue ? this is basically the same as the remote execution, maybe you should add the container (if the agent is running in docker mode) --docker ?