
Reputation
Badges 1
26 × Eureka!to be honest, the use case is mostly convenience
when people train ~5000+ experiments, all saved in few sub folders with long string as experiment name
before publishing a paper for example, we want to move copy small numbers of successful training to separate location and share it with other colleagues/management
I'd guess the alternative can be
changing the name of the successful training under the existing sub folder
using move instead of clone
anything else?
hey @<1523701827080556544:profile|JuicyFox94>
standard standalone Linux using compose
I'm looking at iptables configuration that was done by other teams
trying to find which rule blocks clearml
(all worked when iptables disabled)
oh boy, how much I hate reverse engineer of setup not I did 😞
I'll dig in more
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
iptables -A INPUT -p tcp --dport 8008 -j ACCEPT
iptables -A INPUT -p tcp --dport 8081 -j ACCEPT
the application is functional on localhost for sure
got it, I don't really understand why it happens, quite certain I didn't see this in the past
didn't do that test
I usually wait for first job to finish before I start new one
okie so this works only if jobs run in parallel
first job create new task id
second job (initiated immediately after first job) do the reuse properly
if I wait for first job to finish - then run again new second job with same name, it will not do reuse
is this expected?
not really - I can try to run these in parallel
tried it from single workstation, but I get same unexpected behavior
same project/task names, same workstation running the job, anything else I should check to confirm those are identical jobs?
every new identical job start with ClearML Task: created new task id=...
I have another instance with clearml-server 1.7 and I got same behavior
as I missing anything? I was under the assumption that jobs with same project/task names should be overwritten and not duplicated
from clearml import Task task = Task.init(project_name="Inbar2022/LanguageFactoryDanish/lions_test", task_name="lions3")
python main.py --cuda --epoch 1
compare between both of the tasks
SuccessfulKoala55 any clue?
VivaciousPenguin66 your docs was helpful, I got SSL running but my question remained
have you kept needed http services accessible and only running the authentication via https?api_server: "http://<my-clearml-server>:8008" web_server: "
" files_server: "http://<my-clearml-server:8081"
my current state is that the webserver is accessible via http and https, in 8080 & 443
Hi VivaciousPenguin66
thanks for sharing, giving it a try now
after you set up webserver to point to 443 with HTTPS, what have you done with rest of http services clearml is using?
does weberver with 8080 remained accessible and your are directing to it in your ~clearml.conf
?
what about apiserver and file server? (8008 & 8081)
AgitatedDove14 indeed there are few sub projects
do you suggest to delete those first?
same basic job not gets overwritten, but created new one every time
I didn't saw anything useful in elasic/mongo/api
I do significany slowness to query also my experiments
no filtering for sure
if I send link to task, sometimes it loads and sometimes it's stuck
I'd guess mongo is choking, not sure why
I think there are some experiments that are messing up mongodb
this logs unusual in clearml-mongo logs:
{"t":{"$date":"2023-09-19T12:15:50.685+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn73","msg":"Slow query","attr":{"type":"command","ns":"backend.model","command":{"distinct":"model","key":"project","query":{"$and":[{"$or":[{"company":{"$in":["d1bd92a3b039400cbafc60a7a5b1e52b",null,""]}},{"company":{"$exists":false}}]},{"user":{"$in":["197aea8467d3f471fc0db98b57ed80fa"]...
app.component.ts:138 ERROR TypeError: Cannot read properties of null (reading 'id')
at projects.effects.ts:60:56
at o.subscribe.a (switchMap.js:14:23)
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
at withLatestFrom.js:26:28
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
at filter.js:6:128
at p._next (OperatorSubscriber.js:13:21)
at p.next (Subscriber.js:31:18)
and I see also when trying to...
I had slightly similar scenario ~1 year and few versions back
there was some task that wrote a lot of tasks and mongo didn't took it nicely
I was able to identify to it only by questioning users and eventunaly one of them stopped to send and mongo started to come back and all return to normal
we did not come to any wise conclusion what is root cause or how to identify this
not sure it's same use case but I will begin to ask around people
if you have any other hint/way how to query mongo and look for potential culprit - will be glad to hear
api calls behaves much better
no problem to query tasks in other projects
SuccessfulKoala55 where does the SDK stores the cache?
we have an cluster with shared storage so all computes nodes that is running the jobs has same storage
should I assume it will use the cache and overwrites identical jobs?
trying to reproduce this but still every new and same jobs gets new task ID
I built an basic nginx container
` FROM nginx
COPY ./default.conf /etc/nginx/conf.d/default.conf
COPY ./includes/ /etc/nginx/includes/
COPY ./ssl/ /etc/ssl/certs/nginx/ copied the signed certificates and the modified nginx
deafult.conf `
the important part is to modify the compose file to redirect all traffic to nginx container
` reverse:
container_name: reverse
image: reverse_nginx
restart: unless-stopped
depends_on:
- apiserver
- webserver
- fil...