Badges 1
115 × Eureka!@<1523701070390366208:profile|CostlyOstrich36> Updated webserver and the problem still persists
This is the new stack:
WebApp: 1.15.1-478 • Server: 1.14.1-451 • API: 2.28
notice, we didnt update API (we had running experiments)
Is is even known if the bug is fixed on that version?
It looks like im moving forward
Setting url in clearml.conf without "s3" as suggested works (But I dont add port ther, not sure if it breaks something, we dont have a port)
host: " "
Then in
task: clearml.Task = clearml.Task.init(
output_uri=" None ",
I think connection is created
What im getting now is bucket error, i suppose I have to specify it so...
@<1523701435869433856:profile|SmugDolphin23> Setting it without http is not possible as it auto fills them back in
We dont need a port
"s3" is part of url that is configured on our routers, without it we cannot connect
Getting errors in elastisearch when deleting tasks, get retunred "cant delete experiment"
No, i specify where to upload
I see the data on S3 bucket is beeing uploaded. Just the log messages are really confusing
@<1523701070390366208:profile|CostlyOstrich36> Any news on this? We are currently stuck without this fix, cant finish up clearml setup
Is fileserver folder needed for successful backup?
@<1523701070390366208:profile|CostlyOstrich36> It it still needed since Eugene thinks there is a bug?
hi, thanks for reaching out. Getting desperate here.
Yes, its self hosted
No, only currently running experiments are deleted (task itself is gone, but debug images and models are present in fileserver folder)
What I do see is some random elastisearch errors popping up from time to time
[2024-01-05 09:16:47,707] [9] [WARNING] [elasticsearch] POST
None ` [status:N/A requ...
Yes, but does add_external_files makes chunked zips as add_files do?
WebApp: 1.14.1-451 • Server: 1.14.1-451 • API: 2.28
Bump, still waiting, closing in on a month since we are unable to deploy. We have team of 10+ people
I solved the problem.
I had to add tensorboard loggger and pass it to pytorch_lightning trainer logger=logger
Is that normal?
i need clearml.conf on my clearml server (in config folder which is mounted in docker-compose) or user PC? Or Both?
Its self hosted S3 thats all I know, i dont think it s Minio
im also batch uploading, maybe thats the problem?
- The dataset is about 1TB containing 1 million files
- I dont have the SSD space locally to do the upload
- So i download a part of the dataset, use add_files() and then upload() to that batch
- Upload the dataset
I noticed that each batch is slower and slower
@<1523701435869433856:profile|SmugDolphin23> Any ideas how to fix this?
You can check out boto3 python client (This is what we use to download / upload all S3 stuff), but minio-client probably already uses it under the hood.
We also use aws cli to do some downloading, it is way faster than python.
Regarding pdfs, yes, you have no choice but to preprocess it
@<1709740168430227456:profile|HomelyBluewhale47> We have the same problem. Millions of files, stored on CEPH. I would not recommend you to do it this way. Everything gets insanely slow (dataset.list_files, downloading the dataset, removing files)
The way I use Clearml Datasets for large number of samples now is to save a json which stores all paths to samples in Dataset metadata:
clearml_dataset.set_metadata(metadata, metadata_name=metadata_key)
However this then means that you need wrappe...
@<1523703436166565888:profile|DeterminedCrab71> Thanks for responding
It was unclear to me that I need to set 443 also everywhere in clearml.conf
Setting s3 host urls with 443 in clearml.conf and also in web UI made it work
Im now almost at the finish line. The last thing that would be great is to fix archived task deletion.
For some reason i have error of missing S3 keys in clearml docker compose logs, the folder / files are not deleted in S3 bucket.
You can see how
Can I do it while i have multiple ongoing training?
elastisearch also takes like 15GB of ram