Reputation
Badges 1
25 × Eureka!Hi EmbarrassedSpider34
Long story (see below) short, yes you can ignore this warning :)
Specifically, torch is spinning processes and killing them, every process will have a reference to the parent semaphore (for internal clearml bookkeeping), now python is not very good with this kind of thing (and it is getting better on newer python verions), bottom line python "think" someone lost a semaphore, but there reality is that subprocess never created it in the first place. Does that make sen...
It should preserve the order as the order of the update back (i.e. when executed by the agent) is the same as the order of the keys (obviously py3.7+ becuase it creates dict not Ordered Dicts)
TroubledHedgehog16 if you have a preinstalled conda env then why would you need to reinstall it from yml file? Also if this is the default python env, clearml-agent will inherit from it and use i, (no real overhead there)
Notice the reason for "inheriting system" python environments is so that the agent could cache the individual Task requirements, meaning next time it will not need to reinstall anything
wdyt?
Hi IrritableJellyfish76
https://clear.ml/docs/latest/docs/references/sdk/task#taskget_tasks
task_name
(
str
) β The full name or partial name of the Tasks to match within the specified
project_name
(or all projects if
project_name
is
None
). This method supports regular expressions for name matching. (Optional)
You are right, this is a bit confusing, I will make sure that we add in the docstring an examp...
do I need to have the repo that I am running on my account
If it is a public repo, then no need, credentials are only needed for private repos π
Am I missing something ?
Where exactly are the model files stored on the pod?
clearml cache folder, usually under ~/.clearml
Currently I encounter the problem that I always get a 404 HTTP error when I try to access the model via the...
How are you deploying it? I would start by debugging and runnign everything in the docker-compose (single machine) make sure you have everything running, and then deploy to the cluster
(becuase on a cluster level, it could be a general routing issue, way before getting t...
What I try to do is that DSes have some lightweight baseclass that is independent of clearml they use and a framework have all the clearml specific code. This will allow them to experiment outside of clearml and only switch to it when they are in an OK state. This will also help not to pollute clearml spaces with half backed ideas
So you want the DS to manually tell the baseclasss what to store ?
then the base class will store it for them, for example with joblib
, is this the...
DefiantHippopotamus88 you can create a custom endpoint and do that, but it will be running I the same instance , is this what you are after? Notice that Triton actually supports it already, you can check the pytorch example
No need, it should auto close it if you started it with Task.init (or the agent executed it)
Usually in the /tmp folder under a temp filename (it is generated automatically when spinned)
In case of the services, this will be inside the docker itself
If a Task is in the 'Completed' I think the only option is to 'Reset' it (see image).
In the UI yes, in code you can do task.mark_aborted(force=True)
You do clear the previous run execution but I think for a repetitive task this is fine.
I would avoid that, no?
@<1532532498972545024:profile|LittleReindeer37> nice!!! π
Do you want to PR? it will be relatively easy to merge and test, and I think that they might even push it to the next version (or worst case quick RC)
BTW updating the values in grafana is basically configuration of the heatmap graph, so it is fairly easy to do, just not.automatic
I reached over 1M API calls in about one week using clearml-serving
Oh that makes sense now π
If I remember correctly, adding an additional model to a signal clearml-serving instance should not actually change the number of API calls, they are mostly affected by the number of clearml-serving / containers and not in the number of models.
What about the epochs though? Is there a recommended number of epochs when you train on that new batch?
I'm assuming you are also using the "old" images ?
The main factor here is the ratio between the previously used data and the newly added data, you might also want to resample (i.e. train on more) new data vs old data. make sense ?
Was going crazy for a short amount of time yelling to myself: I just installed clear-agent init!
oh noooooooooooooooooo
I can relate so much, happens to me too often that copy pasting into bash just uses the unicode character instead of the regular ascii one
I'll let the front-end guys know, so we do not make ppl go crazy π
Thanks GiganticTurtle0 !
I will try to reproduce with the example you provided. regardless I already took a look at the code, and I'm pretty sure I know what the issue is. We will be pushing a few fixes after the weekend, I'm hoping this one will be included as well π
Hi PanickyLion56
Yep savefig also works, you can also do,from clearml import Logger Logger.current_logger().report_matplotlib_figure(title="My Plot Title", series="My Plot Series", iteration=10, figure=plt)
https://github.com/allegroai/clearml/blob/0c5d12b830987aa9bb8d44d81e92ff9198008f29/examples/frameworks/matplotlib/matplotlib_example.py#L25
An example for something like spacy would be useful for the community.
That awesome, any chance you can PR something? (no need for it to be perfect, we can take it from there)
Hi ZippyAlligator65
You mean like env vars?
We could use our 8xA100 as 8 workers, for 8 single-gpu jobs running faster than on a single 1xV100 each.
@<1546665634195050496:profile|SolidGoose91> I think that in order to have the flexibility there you need the "dynamic" GPU allocation that is only part of the "enterprise" offering π
That said, why not allocate a single a100 machine? no?
requirements specified with git repo
you mean the reuqirements.txt is inside the gir repo? or do you mean a link to the git-repo as part of the requirements?
Can you also provide an example of the content, I think I have an idea
I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.
SpotlessFish46
yes you can access the entire code in the incomitted changes, you can test it with:task = Task.get_task(task_id='aabb') task_dict = task.export_task()
2. correct, but then if you need the entire code base you need to clone the arepo and apply the uncommitted changes. Basically trains-agent does that when execute with buildtrains-agent build --id aabb --target ~/my_task_env
3. See (2)
(If you are running the trains-agent with the exact same command, I (think) you will get the same worker_id in which you will end up with something similar to what you describe)
To solve it add TRAINS_WORKER_NAME="new_unique_name" trains-agent ...
I think we resolve it automatically, but based on your description it looks like we use the same worker name/id multiple times ...
Hi @<1544853721739956224:profile|QuizzicalFox36>
http:/34.67.35.46:8081/...
notice there is a / missing in the link, how is that possible? it should be http://
When I do the port forward on my own usingΒ
ssh -L
Β also seems to fail for jupyterlab and vscode, too, which i find odd
The only external port exposed is the SSH one 10022, then the client forwards it locally (so you, the user, can always have the same connect i.e. "ssh root@localhost -p 8022")
If you need to expose an additional port , when the clearml-session is running, open another terminal and do:sh root@localhost -p 8022 -L 10123:localhost:6666
This should po...