Reputation
Badges 1
46 × Eureka!@<1523701087100473344:profile|SuccessfulKoala55> looks OK (?)
>>> StorageHelper.get(Task._get_default_session().get_files_server_host())._container.session.verify
InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
True
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
iptables -A INPUT -p tcp --dport 8008 -j ACCEPT
iptables -A INPUT -p tcp --dport 8081 -j ACCEPT
I'm looking at iptables configuration that was done by other teams
trying to find which rule blocks clearml
(all worked when iptables disabled)
I have another instance with clearml-server 1.7 and I got same behavior
as I missing anything? I was under the assumption that jobs with same project/task names should be overwritten and not duplicated
SuccessfulKoala55 where does the SDK stores the cache?
we have an cluster with shared storage so all computes nodes that is running the jobs has same storage
should I assume it will use the cache and overwrites identical jobs?
trying to reproduce this but still every new and same jobs gets new task ID
let me dig in more and hopefully can share successful results
thanks!
yep, again most jobs works .. the issue with when a job tries to upload artifacts to fileserver
I had slightly similar scenario ~1 year and few versions back
there was some task that wrote a lot of tasks and mongo didn't took it nicely
I was able to identify to it only by questioning users and eventunaly one of them stopped to send and mongo started to come back and all return to normal
we did not come to any wise conclusion what is root cause or how to identify this
unfortunately I couldn't fix this
the ES state in hectic, can't delete anything
clearml is still live, read-only mode, all existing indices are readable
new jobs can't write to this clearml server
I didn't saw anything useful in elasic/mongo/api
I do significany slowness to query also my experiments
no filtering for sure
if I send link to task, sometimes it loads and sometimes it's stuck
AgitatedDove14 indeed there are few sub projects
do you suggest to delete those first?
I'd guess mongo is choking, not sure why
app.component.ts:138 ERROR TypeError: Cannot read properties of null (reading 'id')
at projects.effects.ts:60:56
at o.subscribe.a (switchMap.js:14:23)
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
at withLatestFrom.js:26:28
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
at filter.js:6:128
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
and I see also when trying to...
okie so this works only if jobs run in parallel
first job create new task id
second job (initiated immediately after first job) do the reuse properly
if I wait for first job to finish - then run again new second job with same name, it will not do reuse
is this expected?
tried it from single workstation, but I get same unexpected behavior
same project/task names, same workstation running the job, anything else I should check to confirm those are identical jobs?
every new identical job start with ClearML Task: created new task id=...
in case this will help someone else, I did not had root access to the training machine to add the cert to store
you can point your python to your own CA using:
export CURL_CA_BUNDLE=/path/to/CA.pem
same basic job not gets overwritten, but created new one every time
I have tried some small task only uploads single file
logger = task.get_logger()
img = Image.open(f"./1_model.png").convert("RGB")
logger.report_image(title=f"cfg_0", series="Model", iteration=1, image=img)
ended with:
Retrying (Retry(total=0, connect=5, read=5, redirect=5, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)'))': /
202...
didn't do that test
I usually wait for first job to finish before I start new one
got it, I don't really understand why it happens, quite certain I didn't see this in the past
so I think I'm in the right direction
adding verify=
and pointing to my CA.pem looks like the right approach
now, how do I use it with ClearML API?
cleanup_service
for task in tasks:
try:
deleted_task = Task.get_task(task_id=task.id)
print (deleted_task.name)
deleted_task.delete(
delete_artifacts_and_models=True,
skip_models_used_by_other_tasks=True,
raise_on_error=False
)
it throw down the SSL error,...
I built an basic nginx container
` FROM nginx
COPY ./default.conf /etc/nginx/conf.d/default.conf
COPY ./includes/ /etc/nginx/includes/
COPY ./ssl/ /etc/ssl/certs/nginx/ copied the signed certificates and the modified nginx
deafult.conf `
the important part is to modify the compose file to redirect all traffic to nginx container
` reverse:
container_name: reverse
image: reverse_nginx
restart: unless-stopped
depends_on:
- apiserver
- webserver
- fil...
I'm using rpm based machine, but I get your direction
put the cert in the right place for python to looks for it automatically
can I assume if it works smoothly with requests
or urllib3
it will work for the ClearML API?
looking into ES index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b
docs.count docs.deleted store.size pri.store.size
2118131043 29352476 265.1gb 265.1gb
sounds we're hitting some ES limitation?
api calls behaves much better
no problem to query tasks in other projects
from clearml import Task task = Task.init(project_name="Inbar2022/LanguageFactoryDanish/lions_test", task_name="lions3")
python main.py --cuda --epoch 1
to be honest, the use case is mostly convenience
when people train ~5000+ experiments, all saved in few sub folders with long string as experiment name
before publishing a paper for example, we want to move copy small numbers of successful training to separate location and share it with other colleagues/management
I'd guess the alternative can be
changing the name of the successful training under the existing sub folder
using move instead of clone
anything else?
@<1523701435869433856:profile|SmugDolphin23> thanks for good pointers
it did not work on first attempt - requests
did not validated the certs right
I have added this:
token_req = requests.get(api_server + "/auth.login", verify="<my_org_CA>", auth=(access_key, secret_key))```
print(token_req)
I got back
<Response [200]>
which I believe is good right?
when addingtoken = token_req.json()["data"]["token"]
I got errors from json decoder, which I believe is exp...
not really - I can try to run these in parallel