Reputation
Badges 1
30 × Eureka!I’m also not exactly an expert here, but it must be Ceph if it’s possible to be so
Probably so, but not sure:( I’ll have to figure it out with our DevOps engineer
Finally solved it. Turned out it was an authentication issue. In my case, I had to use values for ACCESS_KEY/SECRET other than those which I used with boto3 client
So it’s Ceph (RADOS) Object Gateway in my case
Hi, Eugen!
Thanks for the reference, I'll check it out
More precisely, I'm using Llama factory and I'm running train scripts from there it, like python train.py ...
, without editing them. Therefore I can't create a Clearml Task inside this process to record the experiment to. Of course I can manually add all the parameters, metrics and artifacts afterwards, but ideally, I'd like to be able to have real-time logs of my Llama-factory-experiment in Clearml. The package has integrations wit...
Has anyone done something similar? How did you manage to track real-time data about the experiment to Clearml?
Thank you! I'll try it out and let you know the result
suggest overwriting them locally?
Yeah, that might be an option but it doesn't have enough flexibility for all my scenarios. E.g. I might need to have different N-numbers for the local and remote (ClearML) storage.
It’s a self-hosted one. Its address is s3.kontur.host, port 443
clearml 1.3.2
boto3==1.22.7
botocore==1.25.7
I didn’t deploy the server myself but I verified that it works with s3cmd
I assume you have actual values for
key
and
secret
in:
That’s right, I use the same values which work for that bucket with s3cmd
Do you mean like an example for minio?
Yeah, but with the output_uri
in task initialisation as well. Am I right that in that case it would be like that?output_uri='
s3://my-minio-host:9000/bucket_name '
Tried it, the outcome's still the same though: the artifacts deleted using the task._delete_artifacts() function resurrect on further calls of task.upload_artifact() on new artifacts
the
secure
flag is
false
I played with this setting as well - didn’t make it work
Are you saying we should expose
raise_on_errors
it to _delete_artifacts() function itself?
That'd be a great solution, thanks! I'll create a PR shortly
I was just wondering if there’s some valid example of a clearml.conf
containing the correct on-premises s3 settings so that I could use them as a basis?
BTW, is it correct to set the files_server
in the api
section?files_server: "
s3://s3.kontur.host:443/srs-clearml "
Yeah, it holds. I just sent an extract from the config for it to be concise. Here’s the full version
If I set it to False I get another error:Failed creating storage object
Reason: Missing key and secret for S3 storage access (
)
Did that and still have the same error:Failed creating storage object
Reason: Missing key and secret for S3 storage access (
)
Hi, Erez!
Thank you for the example, I checked it out. It really creates two models. But the thing is, these two models have different file names here. In my scenario, however, it's more convenient for me to have the same file name and different directories for the models. In this case, all my models get overwritten by the latest logged one (as in my screenshot above).
Fortunately, if I use upload_artifact()
instead (which I eventually go with) I manage to achieve what I want (see the s...
SweetBadger76 Could you please verify if that is what you meant. I'm still confused if I'm doing something wrong or everything works as intended and Clearml discriminates models only by the file name.
Unfortunately, the other parameters like tags
and comment
didn't help to separate the models
Hi, Erez!
Thank you for your answer! I'll see if it solves the problem
Thank you but although I'm actually already using the parameter name
mentioned in your response in my code, I can see only one model on the task's page
filename = './models/v1/model.ckpt' torch.save(state_dict, filename) mv1 = OutputModel(name='model_v1', task=task) mv1.update_weights(filename, upload_uri=my_uri) update_model(mynn.multiplier) state_dict = mynn.state_dict() filename = './models/v2/model.ckpt' torch.save(state_dict, filename) mv2 = OutputModel(name='model_v2', task=task) mv2.update_weights(filename, upload_uri=my_uri)
Well. what’s for sure is that I have the required permissions to write to the bucket, as I manage to upload files into it through s3cmd
and boto3
Hi, Jake!
Thanks for your response! I just managed to solve the problem by running my train CLI command in a subprocess and creating a thread to capture the stdout from this subprocess and send it to a Clearml Task. The solution doesn/t even seem too ugly as I was afraid it would be 😀
With this variant of clearml.config I’m now getting a new error:ERROR - Exception encountered while uploading Failed uploading object s3.kontur.host:443/srs-clearml/SpeechLab/ASR/data_logging/test1.1be56a53647646208ffd665908056d49/artifacts/data/valset_2021_02_01_sb_manifest_true_micro.json (405): <?xml version="1.0" encoding="UTF-8"?><Error><Code>MethodNotAllowed</Code><RequestId>tx00000000000000000fc69-0062781afb-eba8e9-default</RequestId><HostId>eba8e9-default-default</HostId></Error>