
Reputation
Badges 1
25 × Eureka!ColossalAnt7 I would do the following:
Configure trains-server user/pass, mounting the API server configuration file as pointed in the trains-server documentation (intermediate temporary step) Start by providing the ML guys with a VPN access that allows them to access directly the trains-server api/web/file pos (caveat is the IP/sub-domain needs to be solved) Configure a ConfigMap to do the routing/ingest (this solves the IP/Sub-Domain issue) and allow the VPN to access the single entrypoint...
I'm already at 300MB of usage with just 15 tasks
Wow, what do you have there? I would try to download the console logs and see what the size you are getting, this is the only thing that makes sense, wdyt?
BTW: to get the detailed size for scalars, maximize the plot (otherwise you are getting "subsampled" data)
but perhaps it is worth adding to the docs page a hint to avoid using the CLEARML_TASK_ID env variable, perhaps I am not the only one to ever try it
Good idea, any thoughts on where ? I cannot find a trivial place to put these things
MysteriousBee56 what do you mean "save Scalars on the machine"? All metrics are sent to the trains server. You can later retrieve them from code, if you need.
I see, you can manually do that with add steps, i.e.
for elem in map:
pipeline.add_step(..., elem)
or you can do that with full logic:
@PipelineDecorator.component(...)
def square_num(num):
return num**2
@PipelineDecorator.pipeline(...)
def map_flow(nums):
res = []
# This will run in parallel
for num in nums:
res.append(square_num)
# this is where we actually wait for the results
for r in res:
print_nums(r)
map_flow([1,2,3,5,8,13])
`...
TBH ClearML doesn't seem to be picking the model up so I need to do it manually
This is odd, cleamrl will pick framework level serialization, but not just any pickle call
Why do I need an output_uri for the model saving? The dataset API can figure this out on its own
So that it knows where to upload it, if your are setting True
this will be the default files server, you can also set iy for shared files system, S3 GCP storage etc.
If no value is passed, it will just log th...
My question is what happens if I launch in parallel multiple doit commands that create new Tasks.
Should work out of the box.
I would like to confirm that current_task ...
Correct.
Thank you @<1523720500038078464:profile|MotionlessSeagull22> always great to hear 🙂
btw, if you feel like sharing your thoughts with us, consider filling our survey , it should not take more than 5min
SmallBluewhale13
And the Task.init registers 0.17.2 , even though it prints (while running the same code from the same venv) 0.17.2 ?
Maybe we should do that automatically ? wdyt?
CloudyHamster42 you mean that when you set sdk.metrics.tensorboard_single_series_per_graph
to True and you rerun the experiment, you are still getting multiple series on the same graph?
What's your Trains version?
Hi DilapidatedDucks58
apologies, this thread slipped way.
I double checked, there server will not allow you to overwrite it (meaning to have it fixed will need to release a server version which usually takes longer)
That said maybe we can pass an argument to the "Task.init" so it ignores it? wdyt?
Hi DilapidatedDucks58
is this something new ?
usually copy pasting directly from the UI parses everything, no?
Hi ConvolutedSealion94
You can archive / delete the SERVING-CONTROL-PLANE
Task from the DevOps project in the UI.
Do notice you will need to make sure the clearml-serving is updated with a new sesison ID or remove it (i.e. take down the pods / docker-compose)
Make sense ?
Were you able to interact with the service that was spinned? (how was it spinned?)
DilapidatedDucks58 use a full link , without the package namegit+
delete logged images and texts though
logged images are also stored there?
Hi PanickyAnt52
hi, is there a way to get back the pipeline object when given a pipeline id?
Yes basically this is a specific type of Task, anything you stored on it can be accessed via the Task object, i.e. pipeline_task=Task.get_task(pipeline_id)
I'm curious, how would you use it?
BTW: since pipeline is also a Task you can have a pipeline launch a step that is a pipeline by its own
Basically you create the Task and make sure the "Dataset" is attached to it:task = Task.init(...) dataset = Dataset.create(task=task) dataset.add_files(...)
This will make sure the code is attached to the Dataset
hmmm, somehow I have a bed feeling about it... Could you check the log, it should say something like "Collecting torch==1.6.0.dev20200421+cu101 from https://"
It should be right at the top of the installation. What do you have there?
Hi ElegantCoyote26
If there is, it will have to be using the docker-mode, but I do not think this is actually possible because this is not a feature of docker. It is possible to do on k8s, but that's a diff level of integration 🙂
EDIT:
FYI we do support k8s integration
Hi @<1560798754280312832:profile|AntsyPenguin90>
The image itself is uploaded in a blackground process, flush just triggers the starting of the process.
Could it be that it is showing a few seconds after?
Thanks DefeatedOstrich93
Let me check if I can reproduce it.
well I do not think you set your pytorch lightining to use cuda:
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/code/.venv/lib/python3.9/site-packages/lightning/pytorch/trainer/setup.py:176: PossibleUserWarning: GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=1)`.
Hi ZealousSeal58
What's the clearml version you are using ?
If there was a "debug mode" for viewing the stack trace before the crash that would've been most helpful...
import traceback traceback.print_stack()
Hi GreasyLeopard35
I try to resume a stopped or aborted parameter optimization experiment,
How are you continuing the HPO? are you runing everything locally? is this with an agent? are you seeing the '[0, 0]' value on the configuration when launching the HPO or when continuing it ?
task.set_script(working_dir=dir, entry_point="my_script.py")
Why do you have this part? isn't it the same code, the script entry point is auto detected ?
... or when I run my_script.py locally (in order to create and enqueue the task)?
the latter, When the script is running locally
So something like
os.path.join(os.path.dirname(file), "requirements.txt")
is the right way?
Sure this will work 🙂