Reputation
Badges 1
25 × Eureka!I think you are correct and the first time you spin the server it is not possible (I mean you need it up to get the access/secerey and only then you can insert them into the helm values) ... 😞
LOL 🙂
Make sure that when you train the model or create it manually you set the default "output_uri"
task = Task.init(..., output_uri=True)
or
task = Task.init(..., output_uri="s3://...")
GaudyPig83
I think there is some mismatch between the code creating the pipeline and the actual Task?! Could that somehow be the case? "relaunch_on_instance_failure" is a missing argument somehow
can you try to launch the entire Pipeline with the latest RC ?pip3 install clearml==1.7.3rc0
I see now.
Let's assume you know which snapshot that was:
` prev_task = Task.get_task(task_id='the_first_training_task_id')
get the second from last checkpoint
task.models['output'][-2].url
prev_scalars = prev_task.get_reported_scalars()
new_task = Task.init('example', 'new task')
logger = new_task.get_logger()
do some fpr loop and report the prev_scalars with logger.report_scalars
new_task.flush(wait_for_uploads=True)
new_task.set_initial_iteration(22000)
start the train `
I start the TaskScheduler, register a task, and stop the scheduler, how do I restart the TaskScheduler in a way that re-register the tasks?
if it's aborted, just re-enqueue it?
(it serializes itself and stores it's state on the Task object, so when re-launched it will deserialize from the last state)
I think you have it on the workers and queues page when you click on the worker you have its detials
HandsomeCrow5 I see, my bad.
BTW: Did you see this one?
https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
And the helper classes here: https://github.com/allegroai/trains/tree/master/trains/automation
HugeArcticwolf77 I think this issue was resolved with the latest version 1.8.0, can you try to rerun the entire pipeline with the latest version?
thought the agent created a new conda env and installed all packages
It does, but I was asking what is written on the Original Task (the one created when you executed the code on your laptop, not when the agent was executing it, when the agent is executing the Task, it writes back All the packages of the entire venv it created, when the Task is run manually, it will list only the packages you import directly (i.e. from package or import package, it actually analyses the code)
My point...
Hi @<1742355077231808512:profile|DisturbedLizard6>
the problem maybe in returning None in get_local_model_file()
This tracks, it means that the model file cannot be downloaded for some reason,
when you click on the model here: None
what doe sit say under "MODEL URL:"?
.replace('file://', '', 1)
I have mounted my s3 bucket at the location /opt/clearml/data/fileserver/ but I can see my data is not being stored in s3 but its storing in ebs. How so?
I'm assuming the mount was not successful
What you should see is a link to the files server inside clearml, and actual files in your S3 bucket
Hi @<1668427971179843584:profile|GrumpySeahorse51>
Could you provide the full stack log?
this erros seems to originate from psutil (which is used) but it lacks the clearml-session context
PompousBeetle71 , the reason I'm asking is the warning you see is due to the fact it cannot detect the filename you are saving your model to ... I'm trying to figure out how that actually happened .
BTW: in the next version we will probably remove this warning altogether, but I'm still curious on how to reproduce 🙂
JitteryCoyote63 correct, you could also use Task.create that creates a Task but does not do any automagic.
I also saw the PR for set_parent, will be merged shortly 🙂 thanks!
Now I see, the scenario is similar to the HyperParameter scenario , see the TrainsJob https://github.com/allegroai/trains/blob/master/trains/automation/job.py
I still don't see why you would change the type of the cloned Task, I'm assuming the original Task had the correct type, no?
BTW: what's the use case? Why do you need to open two Tasks in the same code/script ?
GiganticTurtle0 BTW, this mock example worked out of the box (python 3.6 on Ubuntu):
` from typing import Any, Dict, List, Tuple, Union
from clearml import Task
from dask.distributed import Client, LocalCluster
def start_dask_client(
n_workers: int = None, threads_per_worker: int = None, memory_limit: str = "2Gb"
) -> Client:
cluster = LocalCluster(
n_workers=n_workers,
threads_per_worker=threads_per_worker,
memory_limit=memory_limit,
)
client = Cli...
That said, the arguments are passed Inside the code executed (i.e. monkey patched into the frameworks). This allows it to log and change All the arguments, including the default ones , and allow you to edit them.
Does that make sense ?
Hi @<1610083503607648256:profile|DiminutiveToad80>
You mean the pipeline logic? It should autodetect the imports of the logic function (like any Task.init call)
You can however call Task.force_requirements_env_freeze and pass a local requiremenst.txt
Make sure to call it before create the Pipeline object
None
What are you seeing in the Task that was cloned (i.e. the one the HPO created not the original training task)?
by that I mean, configuration section, do you have the Args there ? (seems like the pic you attached, but I just want to make sure)
Also in the train.py file, do you also have Task.init ?
Hi JitteryCoyote63 you can bus obviously you should be careful they might both try to allocate more GPU memory than they the HW actually has.TRAINS_WORKER_NAME=machine_gpu0A trains-agent daemon --gpus 0 --queue default --detached TRAINS_WORKER_NAME=machine_gpu0B trains-agent daemon --gpus 0 --queue default --detached
it certainly does not use tensorboard python lib
Hmm, yes I assume this is why the automagic is not working 😞
Does it have a pythonic interface form the metrics ?
(with matplotlib 3.2+ I get no warning, let me check with 3.1)
ElegantCoyote26 I don't think Keras logs it anywhere unless you have TB, so nowhere to take the data from...
In short, yes you have to have TB :)
Sorry ScaryLeopard77 I missed the reply,
the tutorial in the readme of clearml-serving repo doesn't mention it though. Where should I set it?
oh dear ... you are right (I think it was there in previous versions)clearml-serving --helphttps://github.com/allegroai/clearml-serving/blob/ce6ec847b1e01c6f5bf35d638e6ceb8148db8a7a/clearml_serving/main.py#L142
This is the equivalent of what is created here in the example:
https://github.com/allegroai/clearml-serving/blob/ce6ec847b...
Hmm MiniatureHawk42 how many files in the zip ?