Reputation
Badges 1
60 × Eureka!SuccessfulKoala55 where does the SDK stores the cache?
we have an cluster with shared storage so all computes nodes that is running the jobs has same storage
should I assume it will use the cache and overwrites identical jobs?
trying to reproduce this but still every new and same jobs gets new task ID
okie so this works only if jobs run in parallel
first job create new task id
second job (initiated immediately after first job) do the reuse properly
if I wait for first job to finish - then run again new second job with same name, it will not do reuse
is this expected?
I have another instance with clearml-server 1.7 and I got same behavior
as I missing anything? I was under the assumption that jobs with same project/task names should be overwritten and not duplicated
from clearml import Task task = Task.init(project_name="Inbar2022/LanguageFactoryDanish/lions_test", task_name="lions3")
python main.py --cuda --epoch 1
got it, I don't really understand why it happens, quite certain I didn't see this in the past
looking into ES index events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b
docs.count docs.deleted store.size pri.store.size
2118131043 29352476 265.1gb 265.1gb
sounds we're hitting some ES limitation?
not really - I can try to run these in parallel
compare between both of the tasks
to be honest, the use case is mostly convenience
when people train ~5000+ experiments, all saved in few sub folders with long string as experiment name
before publishing a paper for example, we want to move copy small numbers of successful training to separate location and share it with other colleagues/management
I'd guess the alternative can be
changing the name of the successful training under the existing sub folder
using move instead of clone
anything else?
thanks @<1523701070390366208:profile|CostlyOstrich36>
I've done this successfully using the API already
as for the sdk option - in which format should I provide the list of tasks/projects to the sdk?
foronly_fields=["id", "name","created","status_changed","status", "user"],
:
output example
{'id': '02a3f5929cf246138994c9243a692219', 'name': 'docfm_v7_safe_32gpu80g_11Jan24_4w', 'created': datetime.datetime(2024, 1, 11, 9, 54, 33, 406000, tzinfo=tzutc()), 'status_changed': dateti...
so I have large json, with list of task id's
which I want to delete in bulk
API is doable
how about the sdk? how do I provide a list of tasks id's to for deletion
from the cleanup example:
for task in tasks:
try:
deleted_task = Task.get_task(task_id=task.id)
deleted_task.delete(
how do I set tasks
, while coming from known list of task id's
@<1523701070390366208:profile|CostlyOstrich36> unfortunately, this is not the behavior we are seeing
same exact issue happen tonight
on epoch number 53 ClearML were shut down, the job did not continue to epoch 54 and eventually got killed with watchdog timer
I didn't saw anything useful in elasic/mongo/api
I do significany slowness to query also my experiments
no filtering for sure
if I send link to task, sometimes it loads and sometimes it's stuck
OK I got everything to work
I think this script can be useful to other people and will be happy to share
@<1523701070390366208:profile|CostlyOstrich36> is there some repo I fork and contribute?
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
iptables -A INPUT -p tcp --dport 8008 -j ACCEPT
iptables -A INPUT -p tcp --dport 8081 -j ACCEPT
we will probably end up pulling the images from docker.io and pushing those to our container registry
didn't do that test
I usually wait for first job to finish before I start new one
the application is functional on localhost for sure
I'm looking at iptables configuration that was done by other teams
trying to find which rule blocks clearml
(all worked when iptables disabled)
oh boy, how much I hate reverse engineer of setup not I did 😞
I'll dig in more
hey @<1523701827080556544:profile|JuicyFox94>
standard standalone Linux using compose
in the UI I also see the display name, so I pulled all the users info, and match name to id
AgitatedDove14 indeed there are few sub projects
do you suggest to delete those first?
ok, hopefully someone will share some thoughts and how it went 🙂
I have tried some small task only uploads single file
logger = task.get_logger()
img = Image.open(f"./1_model.png").convert("RGB")
logger.report_image(title=f"cfg_0", series="Model", iteration=1, image=img)
ended with:
Retrying (Retry(total=0, connect=5, read=5, redirect=5, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)'))': /
202...
@<1523701087100473344:profile|SuccessfulKoala55> looks OK (?)
>>> StorageHelper.get(Task._get_default_session().get_files_server_host())._container.session.verify
InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
True
SDK version: 1.14.4
clearml-server version: Server: 1.14.0-431 • API: 2.28
looks like I can't interact with fileserver
?