Reputation
Badges 1
41 × Eureka!so I think I'm in the right direction
adding verify=
and pointing to my CA.pem looks like the right approach
now, how do I use it with ClearML API?
cleanup_service
for task in tasks:
try:
deleted_task = Task.get_task(task_id=task.id)
print (deleted_task.name)
deleted_task.delete(
delete_artifacts_and_models=True,
skip_models_used_by_other_tasks=True,
raise_on_error=False
)
it throw down the SSL error,...
@<1523701435869433856:profile|SmugDolphin23> thanks for good pointers
it did not work on first attempt - requests
did not validated the certs right
I have added this:
token_req = requests.get(api_server + "/auth.login", verify="<my_org_CA>", auth=(access_key, secret_key))```
print(token_req)
I got back
<Response [200]>
which I believe is good right?
when addingtoken = token_req.json()["data"]["token"]
I got errors from json decoder, which I believe is exp...
@<1523701435869433856:profile|SmugDolphin23> working! here is what I have on Fedora/RHEL
- copy certs to
/etc/pki/ca-trust/source/anchors/
update-ca-trust
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
iptables -A INPUT -p tcp --dport 8008 -j ACCEPT
iptables -A INPUT -p tcp --dport 8081 -j ACCEPT
hey @<1523701827080556544:profile|JuicyFox94>
standard standalone Linux using compose
oh boy, how much I hate reverse engineer of setup not I did 😞
I'll dig in more
to be honest, the use case is mostly convenience
when people train ~5000+ experiments, all saved in few sub folders with long string as experiment name
before publishing a paper for example, we want to move copy small numbers of successful training to separate location and share it with other colleagues/management
I'd guess the alternative can be
changing the name of the successful training under the existing sub folder
using move instead of clone
anything else?
I'm looking at iptables configuration that was done by other teams
trying to find which rule blocks clearml
(all worked when iptables disabled)
I had slightly similar scenario ~1 year and few versions back
there was some task that wrote a lot of tasks and mongo didn't took it nicely
I was able to identify to it only by questioning users and eventunaly one of them stopped to send and mongo started to come back and all return to normal
we did not come to any wise conclusion what is root cause or how to identify this
not sure it's same use case but I will begin to ask around people
if you have any other hint/way how to query mongo and look for potential culprit - will be glad to hear
api calls behaves much better
no problem to query tasks in other projects
I'd guess mongo is choking, not sure why
I think there are some experiments that are messing up mongodb
this logs unusual in clearml-mongo logs:
{"t":{"$date":"2023-09-19T12:15:50.685+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn73","msg":"Slow query","attr":{"type":"command","ns":"backend.model","command":{"distinct":"model","key":"project","query":{"$and":[{"$or":[{"company":{"$in":["d1bd92a3b039400cbafc60a7a5b1e52b",null,""]}},{"company":{"$exists":false}}]},{"user":{"$in":["197aea8467d3f471fc0db98b57ed80fa"]...
unfortunately I couldn't fix this
the ES state in hectic, can't delete anything
clearml is still live, read-only mode, all existing indices are readable
new jobs can't write to this clearml server
I didn't saw anything useful in elasic/mongo/api
I do significany slowness to query also my experiments
no filtering for sure
if I send link to task, sometimes it loads and sometimes it's stuck
app.component.ts:138 ERROR TypeError: Cannot read properties of null (reading 'id')
at projects.effects.ts:60:56
at o.subscribe.a (switchMap.js:14:23)
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
at withLatestFrom.js:26:28
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
at filter.js:6:128
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
and I see also when trying to...
VivaciousPenguin66 your docs was helpful, I got SSL running but my question remained
have you kept needed http services accessible and only running the authentication via https?api_server: "http://<my-clearml-server>:8008" web_server: "
" files_server: "http://<my-clearml-server:8081"
my current state is that the webserver is accessible via http and https, in 8080 & 443
I have another instance with clearml-server 1.7 and I got same behavior
as I missing anything? I was under the assumption that jobs with same project/task names should be overwritten and not duplicated
tried it from single workstation, but I get same unexpected behavior
same project/task names, same workstation running the job, anything else I should check to confirm those are identical jobs?
every new identical job start with ClearML Task: created new task id=...
from clearml import Task task = Task.init(project_name="Inbar2022/LanguageFactoryDanish/lions_test", task_name="lions3")
python main.py --cuda --epoch 1
compare between both of the tasks
SuccessfulKoala55 any clue?
got it, I don't really understand why it happens, quite certain I didn't see this in the past
didn't do that test
I usually wait for first job to finish before I start new one
okie so this works only if jobs run in parallel
first job create new task id
second job (initiated immediately after first job) do the reuse properly
if I wait for first job to finish - then run again new second job with same name, it will not do reuse
is this expected?
the application is functional on localhost for sure
yep, again most jobs works .. the issue with when a job tries to upload artifacts to fileserver
I have tried some small task only uploads single file
logger = task.get_logger()
img = Image.open(f"./1_model.png").convert("RGB")
logger.report_image(title=f"cfg_0", series="Model", iteration=1, image=img)
ended with:
Retrying (Retry(total=0, connect=5, read=5, redirect=5, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)'))': /
202...
looks like I can't interact with fileserver
?