PompousParrot44 unfortunately not yet 😞
But the gist is :
MongoDB stores experiment data (i.e. execution parameters, git ref etc.)
ElasticSearch stores results (i.e. metrics console logs, debug image links etc.)
Does that help?
JitteryCoyote63 you mean in runtime where the agent is installing? I'm not sure I fully understand the use case?!
If i were to push the private package to, say artifactory, is it possible to use that do the install?
Yes that's the recommended way 🙂
You add the private repo here, for the agent to use:
https://github.com/allegroai/clearml-agent/blob/e93384b99bdfd72a54cf2b68b3991b145b504b79/docs/clearml.conf#L65
I didn't realise that pickling is what triggers clearml to pick it up.
No, pickling is the only thing that will Not trigger clearml (it is just too generic to automagically log)
BTW: what's the use case? Why do you need to open two Tasks in the same code/script ?
Can't say I have noticed that, is this a delay on the send ? Which for some reason is correlated with the epochs ? What was the case with 0.17.5?
Your code should have worked, i.e. you should see the 'model.h5' in the artifacts tab. What do you have there?
It should look something like this one:
https://demoapp.trains.allegro.ai/projects/531785e122644ca5b85b2e19b0321def/experiments/e185cf31b2634e95abc7f9fbdef60e0f/artifacts/output-model
BTW:
To manually register any model:
from trains import Task, OutputModel task = Task.init('examples', 'my model') OutputModel().update_weights('my_best_model.h5')
I want that last python program to be executed with the environment that was created by the agent for this specific task
Well basically they all inherit the Python environment that points to the venv they started from, so at least in theory it should be transparent when the agent is spinning the initial process.
I eventually found a different way of achieving what I needed
Now I'm curious, what did you end up doing ?
So this is an additional config file with enterprise?
Extension to the "clearml.conf" capabilities
Is this new config file deployable via helm charts?
Yes, you can also set it company/user wide using the clearml Vault feature (again enterprise, sorry 😞 )
Hi UnsightlySeagull42
Basically you can get the agent to always add additional arguments for the docker run, such as -v for mounting:
https://github.com/allegroai/clearml-agent/blob/948fc4c6ce1ecf33a74619ad570d69b8188f6db9/docs/clearml.conf#L133
You can try just pulling the "metric" section of the Task, but I cannot imaging the network bandwidth is the issue?
Could it be load on the clearml-server (i.e. it needs to handle lots of requests ?)
or creating a dedicated function I would suggest also including the actual sampled point in the HP space.
Could you expand ?
This would be the most common use case, and essentially the reason for running the HPO understanding the sensitivity of metrics with respect to hyper-parameters
Does this relates to:
https://github.com/allegroai/clearml/issues/430
manually" filtering the keys I've put in for the HP space. I find it a bit strange that they are not saved as part of t...
Sounds good to me. DepressedChimpanzee34 any chance you can add a github feature request, so we do not forget to add it?
I pull all the parameters, and then manually filter on the HP keys (manually=I have to plug them in, they are not part of optimizer object)
So is this an improvement to optimizer._get_child_tasks_ids(...)
interface ?
e.g. return a structure like:[ { 'id': task_id, 'hp1': value, 'hp2': value, 'hp3': value, 'objective': dict(title='title', series='series', value=42 }, ]
Hmm check if this one works:optimizer._get_child_tasks_ids( parent_task_id=optimizer._job_parent_id or optimizer._base_task_id, order_by=optimizer._objective_metric._get_last_metrics_encode_field(), additional_filters={'page_size': int(top_k), 'page': 0})
If it does, let's PR it as a dedicated function
DepressedChimpanzee34 something along the lines of:from multiprocessing.pool import ThreadPool p = ThreadPool() def get_last_metric(t): return t.get_last_scalar_metrics() task_scalars_list = p.map(get_last_metric, top_tasks) p.close()
We parallelized network connection as I'm assuming the delay is fetching
this?ids = [t.id for t in top_task]
Hi RipeGoose2
Just to clarify, the issue with the html stuck in cache is a UI, thing, basically the webapp needs to tell the browser not to cache the artifacts, it has nothing to do with how the artifacts are created.
Regardless we love improvements so feel free to mass around with the code and PR once you get something useful 😉
Specifically this is where the html conversion happens
https://github.com/allegroai/clearml/blob/9d108d855f784e1fe7f5691d3b7bf3be64576218/clearml/backend_in...
Hi SmilingFrog76
Great question, sadly multi-node is never simple 🙂
Let's start with the basic, let's assume one worker is available and the other is not, what would you want to happen? (p.s. I'm not aware of flexible multi-node training frameworks, i.e. a framework that can detect another node is available and connect with it mid training, that said, it might exist 🙂 )
Is it being used to ssh to the instance?
It is used for the SSH client so it "knows" the SSH server (does that make sense) ?
No worries, let's assume we have:base_params = dict( field1=dict(param1=123, param2='text'), field2=dict(param1=123, param2='text'), ... )
Now let's just connect field1:task.connect(base_params['field1'], name='field1')
That's it 🙂
Hi SourSwallow36
- The same docker image is used for all three jobs, just because it is easier to manage and faster to download. The full code is available on the trains-server GitHub. If you want to spin the containers manually, check the docker-compose.yml on the main repo, it has all the commands there
- Fork the trains-server, commit the changes and don't forget to PR them ;)
- Elastic search is a database, we use it to log all the experiments outputs, console logs metrics etc. This...
You should have metric :monitor:gpu
variant gpu_0_utilization
Since I see you have none of those, that points to no GPU driver ...
Could that be ?
Hi @<1578555761724755968:profile|GrievingKoala83>
mount s3 as a cache folder
I'm not sure that would be fast enough for cache ...
How to override
/root/.cache/pip
path?
in your clearml.conf fille:
None
then set it to your PV
Weird issue, I'll make sure we fix compatibility with python 3.9
it is just local copy so you can rerun and reconfigure
os.environ['TRAINS_PROC_MASTER_ID'] = '1:da0606f2e6fb40f692f5c885f807902a' os.environ['OMPI_COMM_WORLD_NODE_RANK'] = '1' task = Task.init(project_name="examples", task_name="Manual reporting") print(type(task))
Should be: <class 'trains.task.Task'>
Actually it cannot be differed, long story short when the agent is running the same code we have to verify and pass arguments at import time. I have to wonder, I'm expecting the env variables to be preset (I.e previously set for the entire environment) how come they are manually set inside the code (and wouldn't that break when running with an agent)?
Hey GiganticTurtle0 ,
So basically the issue is the the pipeline function ( prediction_service
) is getting a dict as input, and it is expecting to get basic types... if you were to do the following, it would have worked as expected.prediction_service(**default_config)
I will make sure we flatten any dictionary so that we end up with config/start
, instead of a serialized version of the dict.
wdyt?