data:image/s3,"s3://crabby-images/ea8fc/ea8fc4a242d3fbf9f124d8906a48b69b89ea53a2" alt="Profile picture"
Reputation
Badges 1
25 × Eureka!Thanks MagnificentSeaurchin79 !
Let me check what's the status with this one, could it be the same as this one?
https://github.com/allegroai/clearml/issues/322
Hi @<1699955693882183680:profile|UpsetSeaturtle37>
What's your clearml-session version? where is the remote machine ?
And yes if the network connection is bad we have seen this behavior you can try with --keepalive=true
Notice that these are SSH networking issue, not something to do with the clearml-session layer the --keepalive is trying to automatically detect these disconnects and make sure it reconnects for you.
. Iβm using the default operation mode which uses kubectl run. Should I use templates and specify a service in there to be able to connect to the pods?
Ohh the default "kubectl run" does not support the "ports-mode" π
Thereβs a static number of pod which services are created forβ¦
You got it! π
does the clearml server is a worker i can serve on models?
The serving is done by one of the clearml-agents.
Basically you spin an agent, then this agent is spinning the model serving engine container (fully managed).
(1) install run run clearml-agent (2) run clearml-session CLI to configure and spin the serving engine
Hi @<1566596960691949568:profile|UpsetWalrus59>
you should call it before initializing the Task
Task.ignore_requirements("pywin32")
task = Task.init(...)
Hi ShakyJellyfish91
It seems clearml is using a single connection, that takes a long time download
Hmm, I found this one:
https://github.com/allegroai/clearml/blob/1cb5dbb276026644ae20fef63d58256cdc887818/clearml/storage/helper.py#L1763
Does max_connections=10
mean 10 concurrent connections ?
This depends on how you spined the server, basically as long as you configure the clients (i.e. python clients) correctly, there is no issue.
But the auto generated configuration might be off (in the UI when you credentials it tells the clearml-init
where the server is and the ports)
I would actually recommend subdomains if this is possible
https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_config#sub-domain-configuration
wdyt?
Actually, dumb question: how do I set the setup script for a task?
When you clone/edit the Task in the UI, under Execution / Container you should have it
After you edit it, just push it into the execution with the autoscaler and wait π
Hi SpicyCrab51 ,
Hmm, how exactly is the Dataset opened?
If the Dataset object is alive for 30h it will keep the dataset alive, why isn't it being closed ?
SmarmySeaurchin8
Something like this one:vector_series = np.random.randint(10, size=10).reshape(2,5) logger.report_vector(title='vector example', series='vector series', values=vector_series, iteration=0, labels=['A','B'], xaxis='X axis label', yaxis='Y axis label')
Hi WickedBee96
How can I do that?
clearml-task
https://clear.ml/docs/latest/docs/apps/clearml_task#what-is-clearml-task-for
I know this way to run it in the agent only by enqueue the draft after running it on my local machine so is there another way?
Or maybe are you looking for task.execute_remotely
https://clear.ml/docs/latest/docs/references/sdk/task#execute_remotely
hit ctrl-f5 (reload the page) do you still ge the same error? Is it limited to a specific experiment?
Maybe there is setting in docker to move the space used in a different location?
No that I know of...
I can simply increase the storage of the first disk, no problem with that
probably the easiest π
But as you describedΒ
Β it looks like an edge case, so I donβt mindΒ
π
It will always set it's own environment, wither with static analysis or with "pip freeze" / "conda freeze"
It needs to log the exact setup that was actually installed.
When you later launch it on a remote machine, it can either use this to recreate the environment (using pip or conda), or you can clear the entire section, where it will fall back to "requirements.txt"
Any reason for specifically using the "environment.yaml" ?
@<1562610699555835904:profile|VirtuousHedgehong97>
source_url="s3:...",
This means your data is already on S3 bucket, it will not "upload" it it will just register it.
If you want to upload files, then they should be local and then when you call upload you can specify the target S3 bucket, and the data will be stored in a unique folder in the bucket
Does that make sense ?
VexedCat68
. So the checkpoints just added up. I've stopped the training for now. I need to delete all of those checkpoints before I start training again.
Are you uploading the checkpoints manually with artifacts? or is it autologged & uploaded ?
Also why no reuse and overwrite older checkpoints ?
(If you are running the trains-agent with the exact same command, I (think) you will get the same worker_id in which you will end up with something similar to what you describe)
To solve it add TRAINS_WORKER_NAME="new_unique_name" trains-agent ...
I think we resolve it automatically, but based on your description it looks like we use the same worker name/id multiple times ...
So what you are saying is the workers randomly report on one another's experiments ?
But the git apply failed, the error message is the "xxx already exists in working directory" (xxx is the name of the untracked file)
DefeatedOstrich93 what's the clearml-agent
version?
Hi ShortElephant92
This isn't an issue if the user is using a Service Account JSON Key,
Are you saying that when you are using GS python sdk directly it works?
For context, the google cloud storage SDK allows an authorized user credentials.
ClearML actually uses the google python SDK, the JSON is just a way to pass the credentials to the google SDK, I'm not sure it points to "service account"? where did that requirement came from ?
is it from here ` Service account info was n...
Hmm I think everything is generated inside the c++ library code, and python is just an external interface. That means there is no was to collect the metrics as they are created (i.e. inside the c++ code), which means the only was to collect them is to actively analyze/read the tfrecord created by catboost π
Is there a python code that does that (reads the tfrecords it creates) ?
Hi JitteryCoyote63 report_frequency_sec=30.
controller how frequently monitoring events are sent to the server, default is every 30 seconds (you can change the UI display to wall-time to review). You can change it to 180 so it will only send an event every 3 minutes (for example).
sample_frequency_per_sec is the sampling frequency it uses internally, then it will average the results over the course of the report_frequency_sec
time window, and send the averaged result on the repo...
Hi SteadySeagull18
What does the intended workflow for making a "pipeline from tasks" look like?
The idea is if you have existing Tasks in the system and you want to launch them one after the other with control over inputs (or outputs of them) you can do that, without writing any custom code.
Currently, I have a script which does some
Task.create
's,
Notice that your script should do Task.init - Not Task.create, as Task create is designed to create additional ...
SmallBluewhale13 in your code what are you getting when you print the version:from clearml import __version__ print(__version__)
I'm sorry my bad, this is use_current_task
https://github.com/allegroai/clearml/blob/6d09ff15187197e1f574902352115aa08dc1c28a/clearml/datasets/dataset.py#L663task = Task.init(...) dataset = Dataset.create(..., use_current_task=True) dataset.add_files(...)
GrievingTurkey78 I'm not sure I follow, are you asking how to add additional scalars ?