
Reputation
Badges 1
72 × Eureka!@<1523701087100473344:profile|SuccessfulKoala55> seen this with 1.13.2
worth to try and upgrade to latest?
yep, again most jobs works .. the issue with when a job tries to upload artifacts to fileserver
OK I got everything to work
I think this script can be useful to other people and will be happy to share
@<1523701070390366208:profile|CostlyOstrich36> is there some repo I fork and contribute?
I think this is the right approach, let me have a deeper look
thanks @<1724235687256920064:profile|LonelyFly9>
trying to use projects.get_all
to pull all my projects into single file
and there are more then 500 ...
ohhh severe error here 🙂
I was mixed between other API I worked on .. and did not read carefully the right flag
simple adding page
to the body did the work
thanks again @<1724235687256920064:profile|LonelyFly9>
I'm looking at iptables configuration that was done by other teams
trying to find which rule blocks clearml
(all worked when iptables disabled)
looking into ES index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b
docs.count docs.deleted store.size pri.store.size
2118131043 29352476 265.1gb 265.1gb
sounds we're hitting some ES limitation?
I had slightly similar scenario ~1 year and few versions back
there was some task that wrote a lot of tasks and mongo didn't took it nicely
I was able to identify to it only by questioning users and eventunaly one of them stopped to send and mongo started to come back and all return to normal
we did not come to any wise conclusion what is root cause or how to identify this
I didn't saw anything useful in elasic/mongo/api
I do significany slowness to query also my experiments
no filtering for sure
if I send link to task, sometimes it loads and sometimes it's stuck
AgitatedDove14 indeed there are few sub projects
do you suggest to delete those first?
I'd guess mongo is choking, not sure why
thanks @<1523701070390366208:profile|CostlyOstrich36>
I've done this successfully using the API already
as for the sdk option - in which format should I provide the list of tasks/projects to the sdk?
foronly_fields=["id", "name","created","status_changed","status", "user"],
:
output example
{'id': '02a3f5929cf246138994c9243a692219', 'name': 'docfm_v7_safe_32gpu80g_11Jan24_4w', 'created': datetime.datetime(2024, 1, 11, 9, 54, 33, 406000, tzinfo=tzutc()), 'status_changed': dateti...
tried with my user and edited existing user record in apiserver.conf
it looks ClearML treated this as new user - I did not saw any of the jobs belongs to my user before the change
in the UI I also see the display name, so I pulled all the users info, and match name to id
just confirming this with the user and will share it over here
I do recall in the past that latest version caused this, and downgrading to some prior version fixed the issue
let me get the info and will post back here
10x @<1523701087100473344:profile|SuccessfulKoala55>
we will probably end up pulling the images from docker.io and pushing those to our container registry
@<1523701070390366208:profile|CostlyOstrich36> sorry for not being clear enough
when is next version of clearml-server will be released? I can see last version is from August, is there any ETA for new release in upcoming 1-2 month?
correct, but!
I wrote a script that pulls tasks and limit for user
so I'm looking for users to knows their own id
in advance
oh boy, how much I hate reverse engineer of setup not I did 😞
I'll dig in more
{
"id": "c2e33466fd9de1a0d2be8d803ad4fbed",
"company": "d1bd92a3b039400cbafc60a7a5b1e52b",
"name": "John Doe",
"family_name": "John",
"given_name": "Doe",
"created": "2024-09-24T06:13:59.956000+00:00"
}
this is how the object looks like
@<1523701435869433856:profile|SmugDolphin23> working! here is what I have on Fedora/RHEL
- copy certs to
/etc/pki/ca-trust/source/anchors/
update-ca-trust
@<1523701087100473344:profile|SuccessfulKoala55> looks OK (?)
>>> StorageHelper.get(Task._get_default_session().get_files_server_host())._container.session.verify
InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
True
so I think I'm in the right direction
adding verify=
and pointing to my CA.pem looks like the right approach
now, how do I use it with ClearML API?
cleanup_service
for task in tasks:
try:
deleted_task = Task.get_task(task_id=task.id)
print (deleted_task.name)
deleted_task.delete(
delete_artifacts_and_models=True,
skip_models_used_by_other_tasks=True,
raise_on_error=False
)
it throw down the SSL error,...
let me dig in more and hopefully can share successful results
thanks!
yep that was my approached with no luck so far
hopefully someone from the ClearML dev team can give their input on this
when running in debug and watch the values I get
data = response.json()
projects = data['data']['projects']
all_data.extend(projects)
in each loop iterationprojects
are same 500 valuesall_data
gets append for same 500 values in endless loop
I have bug in my code and can't find where just yet
I know the 500 limit and using it
but my while
loop keeps pulling the same 500 ... and running endless
hey there @<1523701070390366208:profile|CostlyOstrich36>
any chance I get more input on this? anywhere to look in the docs?
I hope you understood what am I looking for