Reputation
Badges 1
25 × Eureka!But that should not mean you cannot write to them, no?!
Hi @<1687643893996195840:profile|RoundCat60>
anyone with access to the server
Is that a thing? If you have access to the server Not sure how "protected" you are even if using a key ring...
(unfortunately I do not think we support anything else, but what did you have in mind?
oh, if this is the case, why not use the "main" server?
if I encounter the need for that, I will adapt and open a PRΒ
Great!
Hi @<1729309120315527168:profile|ShallowLion60>
How did you create those credentials ?
JitteryCoyote63 not yet π
I actually wonder how poplar https://github.com/pallets/click is ?
ShaggyHare67
Now theΒ
trains-agent
Β is running my code but it is unable to importΒ
trains
Β ...
What you are saying is you spin the 'trains-agent' inside a docker? but in venv mode ?
On the server I have both python (2.7) and python3,
Hmm make sure that you run the agent with python3 trains-agent
this way it will use the python3 for the experiments
@<1569858449813016576:profile|JumpyRaven4> fyi clearml-serving was synced π€
Hi @<1533620191232004096:profile|NuttyLobster9>
First nice workaround!
Second could you send the full log? When the venv is skipped then pytorch resolving should be skipped as well, and no error should be raised...
And Lastly could you also send the log of the task that executed correctly (the one you cloned), because you are correct it should have been the same
How can i find queue name
You can generate as many as you like, the default one is called "default" but you can add new queues in the UI (goto workers & queus page, then Queues, and click "+ New Queue"
Although it's still really weird how it was failing silently
totally agree, I think the main issue was the agent had the correct configuration, but the container / env the agent was spinning was missing it,
I'll double check how come it did not print anything
a task of queue B if the next task is of type A it will have to wait,
It seems you imply there are two types of Tasks and they need to be executed one after the other ?
but instead, they cannot be run if the files they produce, were not committed.
The thing with git, if you have new files and you did not add them, they will not appear in the git diff, hence missing when running from the agent. Does that sound like your case?
Hi ShallowArcticwolf27
First of all:
If the answer to number 2 is no, I'd loveee to write a plugin.
Always appreciated β€
Now actually answering the Q:
Any torch.save (or any other framework save) will either register or automatically upload, the file (or folder) in the system. If this is a folder it will be zipped and uploaded, if a file just uploaded to to the assigned storage output (the cleaml-server, any object storage service, or shared folder). I'm not actually sure I...
Hi @<1556812486840160256:profile|SuccessfulRaven86>
Every clearml-serving session (you can have multiple different "sessions") is assumed to be homogeneous, this would mean it will serve the same models on as many nodes as possible supporting multiple models per pod.
In your example I think the easiest is to create two serving sessions one with a node selector for the 24GB node and another for the 16GB node, wdyt?
I see the problem now: conda is failing to install the package from the git, then it reverts to pip install, and pip just fails... " //github.com/ajliu/pytorch_baselines "
and this link on it's own works?
if it does, open your browser dev tools (ctrl+shift+I on chrome, I think), I'm assuming you will see a few errors on CORS or the alike, paste them here
Okay I have an idea, it could be a lock that another agent/user is holding on the cache folder or similar
Let me check something
I didn't realise that pickling is what triggers clearml to pick it up.
No, pickling is the only thing that will Not trigger clearml (it is just too generic to automagically log)
Seems like something is not working with the server, i.e. it cannot connect with one of the dockers.
May I suggest to carefully go through all the steps here, make sure nothing was missed
https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md
Especially number (4)
If we have the time maybe we could PR a fix?!
Hi @<1691620877822595072:profile|FlutteringMouse14>
In the latest project I created, Hydra conf is not logged automatically.
Any chance the Task.init call is not on the main script (where the Hydra is) ?
Seems like
Task.create
is the correct use-case then, since again this is about testing flows using e.g. pytest,
Make sense
This seems to be fine for now, ...
Sounds good! thanks UnevenDolphin73
This will mount the trains-agent machine's hosts file into the docker
Notice the order here:Task.add_requirements("tensorflow") task = Task.init(...)
Then the only other option is the /tmp
is out of space (pip uses it to uncompress the .whl files, then it deletes them)
wdyt?
Hi @<1724960475575226368:profile|GloriousKoala29>
Is there a way to aggregate the results, such as defining an iteration as the accuracy of 100 samples
Hmm, i'm assuming what you actually want is to store it with the actual input/output and a score, is that correct?
Hi @<1715900788393381888:profile|BitingSpider17>
Notice that you need __ (double underscore) for converting "." in the clearml.conf file,
this means agent.docker_internal_mounts.sdk_cache
will be CLEARML_AGENT__AGENT__DOCKER_INTERNAL_MOUNTS__SDK_CACHE
None