elastisearch also takes like 15GB of ram
hi, thanks for reaching out. Getting desperate here.
Yes, its self hosted
No, only currently running experiments are deleted (task itself is gone, but debug images and models are present in fileserver folder)
What I do see is some random elastisearch errors popping up from time to time
[2024-01-05 09:16:47,707] [9] [WARNING] [elasticsearch] POST
None ` [status:N/A requ...
from docker inspect I can see that allegorai/clearml uses:
"CLEARML_SERVER_VERSION=1.11.0",
"CLEARML_SERVER_BUILD=373"
Image hash:ed05631045c4237f59ad48f477e06dd72274ab67e70d2f9adc489431d1ce75d7
I see the debug images in fileserver folder
- is 50GB elastisearch normal? Have you seen it. elsewhere or are we doing something wrong, one thing I think is that we are probably logging too frequently
- Is it possible to somehow clean up this?
I also have noticed that this incident usually happens in the morning at around 6-7AM
Are there maybe some clearnup tasks or backups running on clearml server at those times?
The incident happened last friday (5 january)
Im giving you logs from around that time
I see in clearml-agent that it is created here
@<1523703436166565888:profile|DeterminedCrab71> Thanks for responding
It was unclear to me that I need to set 443 also everywhere in clearml.conf
Setting s3 host urls with 443 in clearml.conf and also in web UI made it work
Im now almost at the finish line. The last thing that would be great is to fix archived task deletion.
For some reason i have error of missing S3 keys in clearml docker compose logs, the folder / files are not deleted in S3 bucket.
You can see how storage_credentials.co...
Im basically trying to force the agent to use uv defined python
ok, I found it.
Are S3 links supported?
I have tried:
Airflow - Pain to setup, old UI and other problems
Prefect - Literaly just tried to setup a simple distributed system, took me a week, I do not recommend this tool at all, horrible documentation, noone helps at slack.
Dagster - Absolute beauty, nice UI, easy to setup (as a pip package or just a docker + postgres), i highly recommend this tool. Takes a bit to get used to it. I will in coming week try this combo of dagster + clearml, where i periodically check some things and if...
im also batch uploading, maybe thats the problem?
- The dataset is about 1TB containing 1 million files
- I dont have the SSD space locally to do the upload
- So i download a part of the dataset, use add_files() and then upload() to that batch
- Upload the dataset
I noticed that each batch is slower and slower
Im doing all of this because there isnt (or im not aware of) any good way understand what datasets are on workers
- Here is how client side clearml.conf looks like together with the script im using to create the tasks. Uploads seems to work and is fixed thanks to you guys 🙌
i can add "source /workspace/.venv/bin/activate", to clearml.conf docker_init_bash_script
However it then tries to access pip, but i dont need no pip, how to disable it, i already have my packages, and uv doesnt even require pip
So from our IT guys i now know that
"s3" part of url is subdomain, we use it in all other libs like boto3 and cloudpathlib, never had any problems
This is where the crash happens inside the clearml Task
What do you mean by reusing the task for clearml Dataset, got a code example?
We have multiple different projects with multiple people working on each project.
This is our most used code on dataset uploading
Our datasets are more than 1TB in size and will grow in size (probably 4TB and up), this means we also need 4TB local storage just to upload the dataset back in zipped format. This is not a good solution.
What we can do I guess is do the downloading locally by some chunks of files?
Download locally 100 files, add_to_clearml dataset, repeat
@<1523701435869433856:profile|SmugDolphin23> Setting it without http is not possible as it auto fills them back in
In which ui? Because there are two ways to do it. When clicking on artifacti url there is a popup (but has no way to change host url)
Our s3 host doesnt have port (didnt specify port in clearml.conf anywhere and upload works)
 or user PC? Or Both?
Its self hosted S3 thats all I know, i dont think it s Minio
Adding bucket in clearml.conf causes the same error: clearml.storage - ERROR - Failed uploading: Could not connect to the endpoint URL: " None "
.get_size()