Reputation
Badges 1
72 × Eureka!from clearml import Task task = Task.init(project_name="Inbar2022/LanguageFactoryDanish/lions_test", task_name="lions3")python main.py --cuda --epoch 1
when running in debug and watch the values I get
data = response.json()
projects = data['data']['projects']
all_data.extend(projects)
in each loop iterationprojects are same 500 valuesall_data gets append for same 500 values in endless loop
I have bug in my code and can't find where just yet
app.component.ts:138 ERROR TypeError: Cannot read properties of null (reading 'id')
at projects.effects.ts:60:56
at o.subscribe.a (switchMap.js:14:23)
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
at withLatestFrom.js:26:28
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
at filter.js:6:128
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
and I see also when trying to...
you are correct and thank you for the reply @<1523701070390366208:profile|CostlyOstrich36>
going forward, I assume the clearml-server open-source releases will be continue to be released in Docker Hub
oh boy, how much I hate reverse engineer of setup not I did 😞
I'll dig in more
the application is functional on localhost for sure
to my understating:failed means that python job exited non gracefully, with errors originated from python
what I miss is how to refer to aborted vs. stopped
does the user initiated the job to stop?
or it's something came from the system running the job?
I did note STATUS MESSAGE: and STATUS REASON:
it N/A in many cases, some get Singal None value, or Forced stop (non-responsive) , but not sure how to refer these fields and what can I learn from them
SDK version: 1.14.4
clearml-server version: Server: 1.14.0-431 • API: 2.28
tried it from single workstation, but I get same unexpected behavior
same project/task names, same workstation running the job, anything else I should check to confirm those are identical jobs?
every new identical job start with ClearML Task: created new task id=...
let me dig in more and hopefully can share successful results
thanks!
okie so this works only if jobs run in parallel
first job create new task id
second job (initiated immediately after first job) do the reuse properly
if I wait for first job to finish - then run again new second job with same name, it will not do reuse
is this expected?
unfortunately I couldn't fix this
the ES state in hectic, can't delete anything
clearml is still live, read-only mode, all existing indices are readable
new jobs can't write to this clearml server
I didn't saw anything useful in elasic/mongo/api
I do significany slowness to query also my experiments
no filtering for sure
if I send link to task, sometimes it loads and sometimes it's stuck
in the UI I also see the display name, so I pulled all the users info, and match name to id
VivaciousPenguin66 your docs was helpful, I got SSL running but my question remained
have you kept needed http services accessible and only running the authentication via https?api_server: "http://<my-clearml-server>:8008" web_server: " " files_server: "http://<my-clearml-server:8081"my current state is that the webserver is accessible via http and https, in 8080 & 443
tried with my user and edited existing user record in apiserver.conf
it looks ClearML treated this as new user - I did not saw any of the jobs belongs to my user before the change
I'd guess mongo is choking, not sure why
I had slightly similar scenario ~1 year and few versions back
there was some task that wrote a lot of tasks and mongo didn't took it nicely
I was able to identify to it only by questioning users and eventunaly one of them stopped to send and mongo started to come back and all return to normal
we did not come to any wise conclusion what is root cause or how to identify this
yep, again most jobs works .. the issue with when a job tries to upload artifacts to fileserver
console showed 401 unauthorized when I tried it
I tried again now and it magically popped up 🤔
I'm using rpm based machine, but I get your direction
put the cert in the right place for python to looks for it automatically
can I assume if it works smoothly with requests or urllib3 it will work for the ClearML API?
hey @<1523701827080556544:profile|JuicyFox94>
standard standalone Linux using compose
hey there @<1523701070390366208:profile|CostlyOstrich36>
any chance I get more input on this? anywhere to look in the docs?
I hope you understood what am I looking for
yep that was my approached with no luck so far
hopefully someone from the ClearML dev team can give their input on this
just confirming this with the user and will share it over here
I do recall in the past that latest version caused this, and downgrading to some prior version fixed the issue
let me get the info and will post back here
10x @<1523701087100473344:profile|SuccessfulKoala55>
looking into ES index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b
docs.count docs.deleted store.size pri.store.size
2118131043 29352476 265.1gb 265.1gb
sounds we're hitting some ES limitation?
got it, I don't really understand why it happens, quite certain I didn't see this in the past
for some reason it's not in REST API docs, but I usedusers.get_all
@<1523701435869433856:profile|SmugDolphin23> thanks for good pointers
it did not work on first attempt - requests did not validated the certs right
I have added this:
token_req = requests.get(api_server + "/auth.login", verify="<my_org_CA>", auth=(access_key, secret_key))```
print(token_req)
I got back
<Response [200]>
which I believe is good right?
when addingtoken = token_req.json()["data"]["token"]
I got errors from json decoder, which I believe is exp...
