CheerfulGorilla72
yes, IP-based access,
hmm so this is the main downside of using IP based server, the links (debug images, models, artifacts) store the full URL (e.g. http://IP:8081/ http://IP:8081/... ) This means if you switched IP they will no longer work. Any chance to fix the new server to the old IP?
(the other option is somehow edit the DB with the links, I guess doable but quite risky)
Hi FrothyShark37
is the task scheduler only acessible through the SDK?
yes, in the open source version this is strictly code based. I know the enterprise tier has a UI for it, but in terms of features I believe this is equivalent
Hi @<1687653458951278592:profile|StrangeStork48>
secrets manager per se,
Quick question, are you running the trains-server over http or https ?
We’d be using https in production
Nice 🙂
@<1687653458951278592:profile|StrangeStork48> , I was reading this thread trying to understand what exactly is the security concern/fear here, and I'm not sure I fully understand. Any chance you can elaborate ?
Assuming it was hashed, the seed would be stored on the same server, so knowing both would allow me the same access, no?
Hi @<1687653458951278592:profile|StrangeStork48>
- Agreed,
- Notice this user/pass is only used for the initial authentication, after that all authentication is done via a signed JWT tokenHow about a GitHub issue with the feature request, if there is enough interest (or someone jumps in offering implementation) we can push it forward. What do you think?
GiddyTurkey39
I would guess your VM cannot access the trains-server
, meaning actual network configuration issue.
What are VM ip and the trains-server IP (the first two numbers are enough, e.g. 10.1.X.Y 174.4.X.Y)
GiddyTurkey39 Hmm I'm assuming that by default it cannot access that IP range.
Are you using virtual-box for the VM?
EDIT:
Can I assume the machine running the VM (a.k.a the host) can access the trains-server
?
Hi GreasyRaven35
You should set the output_uri, in Task init, it will auto upload the model, and register the remote location URLtask = Task.init(..., output_uri=True)
You can also specify a target bucket, if you configured credentials (e.g. output_uri=" s3://bucket ")
Yes, experiments are standalone as they do not have to have any connecting thread.
When would you say a new "run" vs a new "experiment" ? when you change a parameter ? change data ? change code ?
If you want to "bucket them" use projects 🙂 it is probably the easiest now that we have support for nested projects.
We're lucky that they let the developers see their code...
LOL 😄
and it is also set in the
/clearml-agent/.ssh/config
and it still can't clone it. So it must be some security issue internally.
Wait, are you using docker mode or venv mode ? in both cases your SSH credentials should be at the default ~/.ssh
Seems like everything is in order. Can you curl to the API/web/files server?
Maybe you should make
naming_function
as public variable in
SearchStrategy
class or allow changing it in
HyperParameterOptimizer
class?
I like this idea, let's do that
Just making sure, you hit the 1024 character limit on S3 path?
If this is the case we should also fix the "artifact naming" to take that into account (it already does and has a limit, see here:
https://github.com/allegroai/clearml/blob/24464b7c1019f7a7b3149ecb80a379...
GrumpyPenguin23 could you help and point us to an overview/getting-started video?
It's dead simple to install:
Pip install trains-agent
the.n you can simply do:
Trains-agent execute --id myexperimentid
Yes it fully supported, and should work.
Could you share the full execution log ?
Okay we got to the bottom of this. This was actually because of the load balancer timeout settings we had, which was also 30 seconds and confusing us.
Nice!
btw:
in the clearml.conf we put this:
for future reference, you are missing the sdk section:
sdk.http.timeout: 300
.
notation works as well as {}
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117.
The thing is, the agent used to do all the heavy parsing because pytorch never actually had a pip compatible artifactory
But now they do, so the agent basically passed the parsing to pip and just added the correct additional pytorch pip repo.
It seems we need to switch back... wdyt?
I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible
So I tested the "old" code that did the parsing and matching, and it did resolve to the correct wheel (i.e. found that there is no 117 only 115 and installed this one)
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt?
Any other port that could be open? (if SSH is already open we cannot launch another daemon on the same port)
Hi SarcasticSparrow10
which database services are used to...
Mongo & Elastic
You can query everything using ClearML interface, or talk directly with the databases.
Full RestAPI is here:
https://clear.ml/docs/latest/docs/references/api/endpoints
You can use the APIClient for easier pythonic interface:
See example here
https://github.com/allegroai/clearml/blob/master/examples/services/cleanup/cleanup_service.py
What is the exact use case you have in mind?
Okay, so I can't figure why it would "kill" the new experiments, I mean it should run them, but is there any "smart stopping" that causes it to kill he process before it ends ?
BTW: can this be reproduced with the clearml hydra example ?
Hi BroadMole98
What I think I am understanding about trains so far is that it's great at tracking one-off script runs and storing artifacts and metadata about training jobs, but doesn't replace kubeflow or snakemake's DAG as a first-class citizen. How does Allegro handle DAGgy workflows?
Long story short, yes you are correct. kubeflow and snakemake for that matter, are all about DAGs where each node is running a docker (bash) for you. The missing portions (for both) are:
How do I cr...
BroadMole98 Awesome, can't wait for your findings 🙂