Reputation
Badges 1
25 × Eureka!sudo curl -L "
-s)-$(uname -m)" -o /usr/local/bin/docker-compose
Hi LudicrousParrot69
A bit of background:
A Task is a job executed in the system (sometime it is an experiment training, sometime a controller like the pipeline). Basically everything process can be a task.
Specifically the pipeline controller itself (i.e. the process running the Bayesian optimization) is Task in the system (i.e. a job running). What it does (using the HyperParameterOptimizer) is cloning previously executed Tasks (e.g. training experiments), change their parameters and moni...
Would that go under
arguments
?
yes π
Also what is the base path where the git repo is cloned? So if my repo is called myProject.git, what would the full path be?
For example https://github.com/ <user>/myProject.git
btw: how come you do not have this field auto populated from running the code locally or using clearml-task
CLI?
Can I assume that if we have two agents spinning the same experiment, your code will take it from there?
Is this true ?
I think it fails because it tries to install trains twice. Could you remove the trains package, and test? I'm also curious how do you have both installed?!
Hmm... any idea on what's different with this one ?
Hi DilapidatedDucks58 just making sure, the link is pyrorch nightly artifactory? Or is it a direct link to the package? Reason for asking, I was not aware they have proper artifactory... When the task runs the trains agent will update the installed packages with all the installed packages it used. Could you verify you have the correct version?
Regarding the extra files, you are correct, the docker container is reset every run, so they will get lost. What are those files for? Could you add ...
GiganticTurtle0 BTW, this mock example worked out of the box (python 3.6 on Ubuntu):
` from typing import Any, Dict, List, Tuple, Union
from clearml import Task
from dask.distributed import Client, LocalCluster
def start_dask_client(
n_workers: int = None, threads_per_worker: int = None, memory_limit: str = "2Gb"
) -> Client:
cluster = LocalCluster(
n_workers=n_workers,
threads_per_worker=threads_per_worker,
memory_limit=memory_limit,
)
client = Cli...
What happened in the server configuration that all of a sudden you have zero ports open?
WackyRabbit7 you can configure AWS autoscaler with two types of instances , with priority to one of them. So in theory you do not need two autoscaler processes, with that in mind I "think" single IAM should suffice
And the agent section on this machine is:api_server:Β
web_server:Β
files_server:Β
Is that correct?
Yep this will work. BTW check the new pipeline it might have a more flexible solution
https://github.com/allegroai/clearml/blob/master/examples/pipeline/full_custom_pipeline.py
ReassuredTiger98
It seems like clearml is not able to fetch the dependencies correctly whenΒ
importlib
Β is used.
If you have an example please let me know we'll try to fix it :)
Is it possible to read the dependencies manually from a conda environment.yml?
You can set detect_with_conda_freeze: true
in clearml.conf, it will just use the entire conda env
https://github.com/allegroai/clearml/blob/28b85028fe4da3ab963b69e8ac0f7feef73cfcf6/docs/clearml.conf#L170
When looking at the worker details, it says "No queues currently assigned to this worker"
Yes, I think we should have better information there, the "AWS service" is not directly pulling jobs from any specific queue, hence nothing there. It is "listening" to queues and launching machines, those machines will be listening to the queue. I wonder if it is just easier to also make sure it is listed as "assigned" to those queues . wdyt?
to add an init script or to expand its capacity,
@<1546665634195050496:profile|SolidGoose91> I seem to see it in the wizard here, what am I missing?
I solved the issue by implementing my own ClearML logger
This is awesome! any chance you want to PR it to transformers ?
You mean like a name of the artifact ?
Hi FunnyTurkey96
Any chance you can try to run with the latest form GitHub (i just tested your code and it seemed to work on my machine).pip install git+
Well, PipelineDecorator actually allows you to do the same thing, with the same ability that is clone / modify / enqueue.
(I mean, Pipeline with tasks is also great, I just want to clarify that they have the same capabilities in this respect).
Hi RoughTiger69
A. Yes makes total sense . Basically you can use Task.export Task.import to do achieve this process (notice we assume the dataset artifacts links are available on both, usually this is the case)
B. The easiest way would be to use Process , then one subprocess is exporting from dev , where the credentials and configuration is passed with os environment. The another subprocess imports it to the prod server (again with os environment pointing to the prod server). Make sense?
I'm not sure TB support confusion matrix regardless, from anywhere in your code you can do:from trains import Task Task.current_task().get_logger().report_confusion_matrix(...)
I'm wondering what happens if i were to host the instance and one of these were to go down from time to time in production, as the deployments provided by the helm chart are not redundant.
Long story short, it will break the clearml-server, please do not take them down, if you do need to do that, also take down the clearml-server. The python clients will wait until it is up again, so no session would be destroyed
Hi RotundSquirrel78
How did you end up with this command line?/home/sigalr/.clearml/venvs-builds/3.8/code/unet_sindiff_1_level_2_resblk --dataset humanml --device 0 --arch unet --channel_mult 1 --num_res_blocks 2 --use_scale_shift_norm --use_checkpoint --num_steps 300000
the arguments passed are odd (there should be none, they are passed inside the execution) and I suspect this is the issue
Right, if this is the case, then just use 'title/name 001'
it should be enough (I think this is how TB separates title/series or metric/variant )
no, I set the env variable CLEARML_TASK_ID myself
Do not, this is the issue π
this is used internally and messing up the internal state, basically this is one of the signals for the SDK to know there is an agent taking care of things (for example logging the entire console output)
Use any other variable, for example MY_CLEARML_TASK_ID
Hmm, we could add an optional test for the python version, and the fail the Task if the python version is not found. wdyt?
Is there any better way to avoid the upload of some artifacts of pipeline steps?
How would you pass "huge datasets (some GBs)" between different machines without storing it somewhere?
(btw, I would also turn on component caching so if this is the same code with the same arguments the pipeline step is reused instead of reexecuted all over again)