Reputation
Badges 1
25 × Eureka!I mean to reduce the API calls without reducing the scalars that are logged, e.g. by sending less frequent batched updates.
Understood,
In my current trials I am using up the API calls very quickly though.
Why would that happen?
The logging is already batched (meaning 1API for a bunch of stuff)
Could it be lots of console lines?
BTW you can set the flush period to 30 sec, which would automatically collectt and batch API calls
https://github.com/allegroai/clearml/blob/25df5efe7...
Maybe that's the issue :
https://github.com/googleapis/python-storage/issues/74#issuecomment-602487082
The file is never touched, nowhere in the process that file is deleted.
it should never have gotten there, this is not the git repo folder, it one level above...
Hi MortifiedCrow63
I finally got GS credentials, there is something weird going on. I can verify the issue, with model upload I get timeout error while upload_artifacts just works.
Just updating here that we are looking into it.
This looks exactly like the timeout you are getting.
I'm just not sure what's the diff between the Model autoupload and the manual upload.
Yes, I think we just found out it breaks clearml π
could you test with the latest stable, just in case ?
(I'll make sure we have an RC that supports the hydra dev version)
Hi MortifiedCrow63
Sorry getting GS credentials is taking longer than expected π
Nonetheless it should not be an issue (model upload is essentially using the same StorageManager internally)
Hi AdventurousWalrus90
Thank you for the kind words! π
/home/usr_338436_ulta_com/.clearml/venvs-builds/3.7/.gitignore
so this is the error on the agent ?
Just curious about the timeout, was it configured by clearML or the GCS? Can we customize the timeout?
I'm assuming this is GCS, at the end the actual upload is done GCS python package.
Maybe there is an env variable ... Let me google it
Internally we use blob.upload_from_file it has a default 60sec timeout on the connection (I'm assuming the upload could take longer).
Glad to hear!
(yeah @<1603198134261911552:profile|ColossalReindeer77> I'm with you the override is not intuitive, I'll pass the info to the technical writers, hopefully they can find a way to make it easier to understand)
Thatβs the question i want to raise too,
No file size limit
Let me try to run it myself
to enable access to the s3 bucket. In this case I wonder how clearml sdk gets access to the s3 bucket if it relies on secret access key and access key id.
Right, basically someone needs to configure the "regular" environment variables for boto to use the IAM role, clearml will basically uses boto, so it should be transparent. does that make sense ? How do you spin the job on the k8s cluster and how do you configure it?
ince these are temp credentials awe need to use the sessi...
In order to clone the Task it needs to complete sync, which implies closing. I guess the use case for execute remotely while still running was not considered. How / why is this your workflow? Specifically how does Jupyter get into the picture?
GiganticTurtle0 you mean the repo for the function itself ?
the default assumes the function is "standalone", you can specify a repo with:@PipelineDecorator.component(..., repo='.')
will take the current folder's repo (i.e. the local one)
you can also specify repo url/commit etc (repo=' https://github/user/repo/repo.git ' ....)
See here:
https://github.com/allegroai/clearml/blob/dd3d4cec948c9f6583a0b69b05043fd60d8c103a/clearml/automation/controller.py#L1931
Hi DepressedChimpanzee34
I think main issue here is slow response time from the API server, I "think" you can increase the number of API server processes, but considering the 16GB, I'm not sure you have the headroom.
At peak usage, how much free RAM so you have on the machine ?
Hi @<1523702307240284160:profile|TeenyBeetle18>
and url of the model refers to local file, no to the remote storage.
Do you mean that in the Model tab when you look into the model details the URL points to a local location (e.g. file:///mnt/something/model) ?
And your goal is to get a copy of that model (file) from your code, is that correct ?
Hi @<1523702786867335168:profile|AdventurousButterfly15>
I am running cross_validation, training a bunch of models in a loop like this:
Use the wildcard or disable all together:
task = Task.init(..., auto_connect_frameworks={"joblib": False})
You can also do
task = Task.init(..., auto_connect_frameworks={"joblib": ["realmodelonly.pkl", ]})
creating a dataset with parents worked very well and produced great visuals on the UI!
woot woot!
I tried the squash solution, however this somehow caused a download of all the datasets into my
so this actually works, kind or like git squash, bottom line it will repackage the data from all the different versions into one new version. This means downloading the data from all squashed versions, then repackaging it into a single new version. Make sense ?
GreasyPenguin14 thank you! that will make our life a lot easier π
ProudMosquito87 Just a few pointers on how we convert the TB histograms to awesome (but less accurate) 3D surfaces.
First I have to admit, I almost never use these histograms, maybe to detect a plateau of if something goes really wrong...
The 3D surface is basically grouping all the histograms and then bucketing them (I think the default is 50 buckets) so that you get a general feel of what's going on, not necessary a detailed view. Bottom line, you are correct, the TB is the source of truth...
- Yes the challenge is mostly around defining the interface. Regarding packaging, I'm thinking a similar approach to the pipeline decorator, wdyt?
- Clearml agents will be running on k8s, but the main caveat is that I cannot think of a way to help with the deployment, at the end it will be kubectl that users will have to call in order to spin the containers with the agents, maybe a simple CLI to do that for you?
DefeatedCrab47 no idea, but you are more then welcome to join the thread here, and point it out:
https://github.com/PyTorchLightning/pytorch-lightning-bolts/issues/249
Let say I donβt have the data on my local machine but only S3 bucket.
You can still register it, but make sure you do not delete it from the S3 bucket because it will keep a link to it
Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')': /
what did you put in output_uri ?
This only talks about bugs reporting and enhancement suggestions
I'll make sure this is fixed π
I don't see any requests
This points to configuration, specifically maybe it is directed to a different server?!