ERROR: Could not install packages due to an EnvironmentError:
[Errno 28] No space left on device
BTW: @<1523703080200179712:profile|NastySeahorse61> this sounds like docker out of space on the Main disk '/var/` where it stores all the images and temp file systems
This will cause you code to fail as any runtime change to the container file system will raise this out of disk space error
I think it fails because it tries to install trains twice. Could you remove the trains package, and test? I'm also curious how do you have both installed?!
Hi DefeatedCrab47
You should be able to change the Web server port , but API port (8008) cannot be changed. If you can login to the web app and create a project it means everything is okay. Notice that when you configure trains ( trains-init ) the port numbers are correct 🙂
Oh, and good job starting your reference with an author that goes early in the alphabetical ordering, lol:
LOL, worst case it would have been C ... 🙂
for a TPU with more than 16GB GRAM and less than 40GB, so sometime we need to provision a A100 to get the training speed we want but we don't use all the GRAM
Oh that makes sense...
Just saw this one, this might help?
https://www.globenewswire.com/news-release/2022/10/24/2539924/0/en/ClearML-and-Genesis-Cloud-Announce-New-MLOps-Partnership-Delivering-100-Green-Energy-Compute-Solution-for-Machine-Learning.html
Hi WickedGoat98
Will I need to wrap their execution in python by system calls?
That would probably be the easiest solution 🙂
Then you can plug it into your pipeline as a preprocessing Task:
You can check this example:
https://github.com/allegroai/trains/tree/master/examples/pipeline
BoredHedgehog47 were you able to locate the issue ?
Another point I see is, that in the workers & queses view the GPU usage is not been reported
It should be reported, if it is not, maybe you are running the trains-agent in cpu mode ? (try adding --gpus)
However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
and all the files are registered in the metadata? coulf you add --verbose to the sync command to see what it is doing
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure
This is also odd, it should Not flatten the folder structure. What is your OS / Python / clearml version?
Is this reproducible ? if so, how ...
Hmm check if this one works:optimizer._get_child_tasks_ids( parent_task_id=optimizer._job_parent_id or optimizer._base_task_id, order_by=optimizer._objective_metric._get_last_metrics_encode_field(), additional_filters={'page_size': int(top_k), 'page': 0})If it does, let's PR it as a dedicated function
FYI all the git pulls are cached even in docker mode so there is no "tax" to pay for pulling the sub-modules (only the first time of course)
So I can set output_uri = "s3://<bucket_name>/prefix" and the local models will be loaded into the s3 bucket by ClearML ?
Yes, magic 🙂
hey, that worked! what library is being used that reads that configuration?
It's passed to boto3, but the pyhon interface and aws cli use different configuration, I guess, because otherwise it should have worked...
Our datasets are more than 1TB in size and will grow in size (probably 4TB and up), this means we also need 4TB local storage
Yes, because somewhere you will have to store your unzipped files.
Or you point to the S3 bucket, and fetch the data when you need to access it (ore prefetch it) with the S3 links the Dataset stores, i.e. only when accessed
Dynamic GPU option only available with Enterprise version right?
Correct 🙂
Hmm, I think "it" misses the fact callbacks are not a package.
Any chance you can post the code here? (or DM me)
but the debug samples and monitored performance metric show a different count
Hmm could you expand on what you are getting, and what you are expecting to get
Hi @<1547028074090991616:profile|ShaggySwan64>
I'm guessing just copying the data folder with rsync is not the most robust way to do that since there can be writes into mongodb etc.
Yep
Does anyone have experience with something like that?
basically you should just backup the 3 DBs (mongo, redis, elastic) each one based on their own backup workflows. Then just rsync the files server & configuration.
WickedGoat98 give me a minute, I'm not sure it is not ClearML related
You can query the system and get all the experiments based on date, then grab the machine GPU metrics.
DefeatedCrab47 check the cleanup service, it queries the system with the Apiclient.
https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/examples/services/cleanup/cleanup_service.py#L72
Hi RipeGoose2
Could you expand on "inconsistency in the iteration reporting" ? Also "calling trainer.fit multiple" would you expect it to show as a single experiment or is it kind of param search ?
Hi CleanPigeon16
You need to pass the private repository docker credentials to the aws instance, I would use the custom bash script option of the aws autoscaler to create the docker credentials file.
ReassuredTiger98 no, but I might be missing something.
How do you mean project-specific?
Yes, that makes sense. Then you would need to use wither the AWS vault features, or the ClearML vault features ...
Do you want to open an issue in pip?
Funny enough this works in:
pip3 install "torch >=2.1.0.*, <2.1.1.*" --extra-index-url
ReassuredTiger98
How can I make clearml-agent use pre-installed version from the nvidia/pytorch
If the Same version is required, the agent will not try to reinstall it (the new venv the agent is creating inside the container, inherits from the preinstalled system packages)
Comes with PyTorch Version 1.12 based on a commit
. I tried
torch >= 1.11
,
torch == 1.12
If in your installed packages you have torch==1.12
the agent should not tr...
ThickDove42 looking at the code, I suspect it fails interacting with the actual jupyter server (that is running on the same machine, but still).
Any chance you have a firewall on the Windows machine ?
My question is what should be the path to the requirements.txt file?
Is it relative to the repo base?
This is actually in runtime (i.e. when running the code), so relative to the working directory. Make sense ? (you can specify absolute path, probably something I would avoid in the code base though...)
Could it be you have two entries of "console_cr_flush_period" ?
'config.pbtxt' could not be inferred. please provide specific config.pbtxt definition.
This basically means there is no configuration on how to serve the mode, i.e. size/type of lower (input) layer and output layer.
You can wither store the configuration on the creating Task, like is done here:
https://github.com/allegroai/clearml-serving/blob/b5f5d72046f878bd09505606ca1147d93a5df069/examples/keras/keras_mnist.py#L51
Or you can provide it as standalone file when registering the mo...