
Reputation
Badges 1
25 × Eureka!SmugOx94 Yes, we just introduced it 🙂 with 0.16.3
Discussion was here (I'll make sure to update the issue that the version is out)
https://github.com/allegroai/trains/issues/222
In your trains.conf
add the following line:sdk.development.store_code_diff_from_remote = true
It will store the diff from the remote HEAD instead of the local one.
I mean manually you can get the results and rescale but, not through the UI
DepressedChimpanzee34
so parsing bask is done via a yaml reader:
https://github.com/allegroai/clearml/blob/49fcbd7bbf3236f4175cdff29fa951847b0923cc/clearml/backend_interface/task/args.py#L506
We could add extra test here, checking for \ in the string, that should solve it and will be backwards compatible (I think)
https://github.com/allegroai/clearml/blob/49fcbd7bbf3236f4175cdff29fa951847b0923cc/clearml/backend_interface/task/task.py#L935
Hi DeliciousKoala34
Happened when cloning and running a task on an agent on a different machine. I
sounds like torch internal issue, can you send the full log of the remote Task ?
If this is the case why not have the stream process call the rest api, then move forward with the result? This way it scales out of the box, the main "conceptual" difference is that the restapi is used internally, and the upside is the event streaming processing becomes part of the application layer, not tied with the compute cost of the model , wdyt?
Test it on your local setup (I would hate to push a broken fix)
Is that possible?
Like get the tasks that uses the most metrics API?
I think you have it on the workers and queues page when you click on the worker you have its detials
I think the main issue is that for some reason the container running changed one of the files inside the temp folder. then the host machine is "stuck" with a file that the root user owned/changed, and now it cannot reuse / delete the temp folder.
I think the fix is to make sure the container deleted the temp folder when it is done
I think I found something, let me dig deeper 🙂
VexedCat68 actually a few users already suggested we auto log the dataset ID used as an additional configuration section, wdyt?
Understood, then I would use Task.remote_execution()
Basically :
task = Task.init(...)
# config some stuff
task.remote_execute(quque_name_here)
# this line will be executed on the remote machine only
This will both automatically log your code / repo with Task.init, and the call to Task.remote_execute will stop the local process (on your machine that runs the hydra sweep) and continue on the remote machine.
This will both allow you to use Hydra sweet & schedule / run on remote ...
I tried to export them to json and they don't take more than 50KB each, but maybe they take more memory internally?
Ballpark should be the same.
I'm already at 300MB of usage with just 15 tasks
Maybe it was not updated yet? meaning you had more and deleted? (I think this is updated asynchronously, with max of 24h)
GreasyPenguin66 you can pass:AZURE_STORAGE_ACCOUNT AZURE_STORAGE_KEY
As the default azure access/secret 🙂
Okay let's see if I can reproduce it:
new conda env py==3.8 install clearml == 0.17.5rc5 matplotlib == 3.3.4 numpy == 1.20.1 seaborn == 0.11.1Clone repo run `python examples/frameworks/matplotlib/matplotlib_example.pyRight ?
(you can find it in the pipeline component page)
that does happen when you create a normal local task, that's why i was confused
The parts that are not passed in both cases are the configurations from the conf file. Only the environment is passed (e.g. git python packages etc) , . For example if you have storage credentials in your conf file , they are not passed to a remote agent, instead the credentials from the remote agent are used when it runs the task.
make sense?
Verified.
BattyLizard6 can you open a github issue? I want to make sure this issue is addressed 🙂
Hi UpsetBlackbird87
This is an Optuna decision on how many concurrent tests to run simultaneously.
You limited it to 100, but remember Optuna does a Bayesian optimization process, where it decides on the best set of arguments based on the performance of the previous set, this means it will first try X trials, then decide on the next batch.
That said you can a pruner to Optuna specifying how it should start
https://optuna.readthedocs.io/en/v1.4.0/reference/pruners.html#optuna.pruners.Median...
but when I run the same task again it does not map the keys.. (edited)
SparklingElephant70 what do you mean by "map the keys" ?
based on this one:
https://stackoverflow.com/questions/31436407/git-ls-remote-returns-fatal-no-remote-configured-to-list-refs-from
I think this is a specific issue of the local git repo configuration, can you verify
(btw: I tested with git 2.17.1 git ls-remote --get-url
will return the remote url, without an error)
Hi @<1536518770577641472:profile|HighElk97>
Is there a way to change the smoothing algorithm?
Just like with TB, this is front-end, not really something you can control ...
That said you can report a smoothed value (i.e. via python) as additional series, wdyt ?
Correct the serving Task ID is the clearml serving session. It is the instance that holds all the information of this specific setup and models
Hi @<1730396272990359552:profile|CluelessMouse37>
However, the caching doesn't seem to be working correctly. Despite not changing the configuration, the first step runs every time.
How are you creating the cached component?
is this a standalone script or a git repo link?
These parameters are dictionaries of specific configurations (dict of dict) that are the same but might not be taken into account properly by the caching mechanism.
hmm for the component to be cached (or reuse...
WickedGoat98 nice!!
Can you also pass the login screen (i.e. can you access the api server)
MelancholyElk85 notice there is the pipeline controller queue (i.e. which agent will run the logic of the pipeline), and the default queue for the pipeline steps (i.e. the actual steps of the pipeline).
The default queue for the pipeline logic itself is services
. you can change it ( pipeline.start(..., queue='another_q')
)
Make sense ?