
Reputation
Badges 1
25 × Eureka!what's the error/reply ?
Can you also share the full log? the numbers seem of (and clearml cannot actually "invent" those numbers they are coming from somewhere...)
Is this a logging
issue, or clearml issue ?
Hi ColossalDeer61 ,
Xxx is the module where my main experiment script resides.
So I think there are two options,
Assuming you have a similar folder structure-main_folder
--package_folder
--script_folder
---script.py
Then if you set the "working directory" in the execution section to "." and the entry point to "script_folder/script.py", then your code could do:from package_folder import ABC
2. After cloning the original experiment, you can edit the "installed packages", and ad...
Task.enqueue will execute immediately, i need execute task to spesific time
Oh I see what you mean, trigger -> scheduled (cron alike) -> Task executed.
Is that correct?
Although I didn't understand why you mentioned
torch
in my case?
Just a guess 🙂 other frameworks do multi-process as well,
I would guess it relates to parallelization of Tasks execution of the
HyperParameterOptimizer
class?
Yes that might be it, it's basically by product of using python "Process" class for multiprocessing. we are working on a fix, not a trivial unfortunately
Hi FancyWhale93 , in your clear.conf configure default output uri, you can specify the file server as default, or any object storage:
https://github.com/allegroai/clearml-agent/blob/9054ea37c2ef9152f8eca18ee4173893784c5f95/docs/clearml.conf#L409
Hi @<1545216070686609408:profile|EnthusiasticCow4>
will ClearML remove the corresponding folders and files on S3?
Yes and it will ask you for credentials as well. I think there is a way to configure it so that the backend has access to it (somehow) but this breaks the "federated" approach
Hi DepressedChimpanzee34 , took me a while but I think there is a solution:
In your docker file, replace:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L5
withentrypoint: /bin/bash command: -c "mkdir -p /var/log/clearml && cd /opt/clearml/ && python3 -m apiserver.apierrors_generator && gunicorn -w 4 -t 600 --bind=0.0.0.0:8008 apiserver.server:app"
but I cannot compare between them
I think we noticed it, and this will be fixed in the next server update (again, some plotly.js issue there)
ReassuredTiger98 I guess this is a plotly feature, none the less I think you can shift the Y axis manually (click and drag)
Hi IcySwallow94
Are you deploying the clearml server with the helm chart ?
not really 😞
Why would you want to set it up manually ? makes sense to have it in the cache folder, no?
Hi WearyLeopard29
Yes 🙂 this is exactly how it should work
It just seems frozen at the place where it should be spinning up the tasks within the pipeline
And is there an agent for those ? usually there is one agent for running logic tasks (like pipelines) running with --services-mode
which means multiple Tasks can be executed by the same agent. And other agents for compute Tasks that are a signle Task per agent (but you can run multiple agents on the same machine)
Hi PanickyMoth78
Hmm yes, I think the StorageManager (i.e. the google storage pythonclinet) also needs a json file with the credentials.
Let me check something
EFS get downloaded to the k8 pod local volume?
EFS is an Amazon service that mounts a persistent FS into ec2 instances, I believe they have support for k8s as a service as well, which would make it kind of like a PV only as a service.
Does that make sense ?
Another (minor) issue is that all the packages that are installed using git+https are cloned and installed twice, immediately one after the other
Yes this is so that we can better log the installed package name, not a major issue, but we just fixed a bug with derivative packages from git packages.
https://github.com/allegroai/trains/issues/196
Hi BitterStarfish58
Where are you uploading it to?
Okay this seems correct:
pytorch=1.8.0=py3.7_cuda11.1_cudnn8.0.5_0
I can't seem to find what's the diff between the two.
Give me a second let me check if I can reproduce it somehow.
is some explanation of how functions become pipelines and components.
That is a good point! I'll make sure we mention it in the pipeline section of the docs
This whole experiment with random numbers started as my attempt at verifying that code in clearml.pipeline
Correct 🙂 BTW: this is why we added PipelineDecorator.run_locally()
so it is easier to debug, you can also use PipelineDecorator.debug_pipeline()
to run the entire pipeline in a single process as python ...
It's the same but done from outside, you want the same and "offline" as well right?
OutrageousSheep60
I found the task in the UI -
and in the
UNCOMMITTED CHANGES
execution section there is
No changes logged
This is the issue.
and then run the
session
via docker
clearml-session --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 \ --packages "clearml" "tensorflow>=2.2" "keras" \ --queue MY_QUEUE \ --verbose
Are you running the "cleamrl-session" from your machine? (i.e. not from inside a docker) ?...
Any chance your code needs more than the main script, but it is Not in a git repo? Because the agent supports either single script file, or a git repo with multiple files
clearml_agent: ERROR: Can not run task without repository or literalscript in
script.diff
This is odd ...
OutrageousSheep60 when you launch clearml-session
it tells you the session ID (which is also a Task ID), can you look for it in the UI and check there is something in the repo/uncommitted-changes section ?
That's not possible, right?
That's actually what the "start_locally" does, but the missing part is starting it on another machine without the agent (I mean it totally doable, and if important I can explain how, but this is probably not what you are after)
I really need to have a dummy experiment pre-made and have the agent clone the code, set up the env and run everything?
The agent caches everything, and actually can also just skip installing the env entirely. which would mean ...