And when retrieve just this file? is it working ?
(Maybe for some reason the file is corrupted) ?
That didn’t gave useful infos, was that docker was not installed in the agent machine x)
JitteryCoyote63 you mean "docker" was not installed and it did not throw an error ?
Hmm let me check something
Hi @<1730033904972206080:profile|FantasticSeaurchin8>
You mean in the UI , or when reporting on the SDK?
Hi ExasperatedCrocodile76
It seems like it is using conda package manager, were you using conda when you run the code manually ?ERROR: This cross-compiler package contains no program /home/ivan/miniconda3/envs/clearML/bin/x86_64-conda_cos6-linux-gnu-gfortran
Why is it trying to install from source code?
BTW: can you test with the latest agent RC? ( pip install clearml-agent==1.4.0rc4
)
NICE! MoodyCentipede68 this is awesome 🙂
WickedElephant66 this seems like a general network issue, like the docker service is missing your companies firewall certificate.
Can you pull any container from docker hub ?
Ok, but when
nvcc
is not available, the agent uses the output from
nvidia-smi
right? On one of my machine,
nvcc
is not installed and in the experiment logs of the agent runnin there,
agent.cuda =
is the version shown with
nvidia-smi
Already added to the next agent's version 😉
Hi @<1523701304709353472:profile|OddShrimp85>
Do you mean Dataset.get_local_copy()
?
Hmm good question, I'm actually not sure if you can pass 24GB (this is not a limit on the GPU memory, this affects the memblock size, I think)
I want the model to be stored in a way that clearml-serving can recognise it as a model
Then OutputModel or task.update_output_model(...)
You have to serialize it, in a way that later your code will be able to load it.
With XGBoost, when you do model.save clearml automatically picks and uploads it for you
assuming you created the Task.init(..., output_uri=True)
You can also manually upload the model with task.update_output_model or equivalent with OutputModel class.
if you want to dis...
I think the main difference is that I can see a value of having access to the raw format within the cloud vendor and not only have it as an archive
I see it does make sense.
Two options, one, as you mentioned use the ClearML StorageManager to upload the files, then register them as external links with Dataset.
Two, I know the enterprise tier has HyperDatasets, that are essentially what you describe, with version control over the "metadata" and "raw storage" on the GCP, including the ab...
I would just add git+
None to your requirements (either in the requirements.txt or even better as part of the pipeline/component where you also specify the repo to be used)
The agent will automatically push the crednetilas when it installs the repo as wheel.
wdyt?
btw: you might also get away with adding -e .
into the requirements.txt (but you will need to test that one)
If you cannot change the "TrainerState" (i.e. inherit and pass it into the code)
you cloud also monkey-patch it, something like
` class OurTrainerState(TrainerState):
def init(...)
...
def load_from_json(cls, json_path: str):
super().load_from_json(json_path))
Task.current_task().upload_artifact(...)
trainer.state = OurTrainerState(trainer.state) `
nice @<1724960458047229952:profile|EnergeticKoala33> !
The issue was that the agent was trying to start the docker but had no credentials to do that, your solution is exactly what was needed to be done
Hi MelancholyBeetle72 , that's a very interesting case. I can totally understand how storing a model and then immediately renaming it breaks the upload. A few questions, is there a way for pytorch lightning not to rename the model? Also I wonder if this scenario happens a lot (storing model and changing it) . I think the best solution is for Trains to create a copy of the file and upload it in the background. That said the name will still end with .part What do you think?
Hi WackyRabbit7
First always check the functions on the Task object, they are the most straight forward access to the system.
Then if you need general purpose API calls, currently they are only documented in the doc-string of the API schema (that said it should be quite documented)
You can check all the endpoints https://github.com/allegroai/trains/tree/master/trains/backend_api/services/v2_8
And finally if you want to easily use the RestAPI :
` from trains.backend_api.session.client impo...
No after, do you see the poetry lock removed in the uncommitted changes?
What's strange is that the remote jobs, as soon as they are launched, if I compare their configs while in state pending, they have the right all different configs, but later, while running,
Wait I think I found it, since usuallyu the case with hydra you configure everything from overrides / config, when launched remotely it looks at it by default. But with the launch plugin it should be overwritten with the Task
` task = Task.init(...)
task.set_parameter(name="Hydra/_allow_omegaconf_ed...
The dokcer itself does not have the host configured.
hmmm, somehow I have a bed feeling about it... Could you check the log, it should say something like "Collecting torch==1.6.0.dev20200421+cu101 from https://"
It should be right at the top of the installation. What do you have there?
Hi ApprehensiveFox95
I think this is what you are looking for:step1 = Task.create( project_name='examples', task_name='pipeline step 1 dataset artifact', repo='
` ',
working_directory='examples/pipeline',
script='step1_dataset_artifact.py',
docker='nvcr.io/nvidia/pytorch:20.11-py3'
).id
step2 = Task.create(
project_name='examples', task_name='pipeline step 2 process dataset',
repo=' ',
working_directory='examples/pipeline',
script='step2_data_pr...
Hi RobustFlamingo1
The ClearML Orchestrator looks interesting. But the website suggests that K8S is required
No k8s is not a must, only an option 🙂
We have a Linux training box (LambdaBox) where we want to run training. Can we place the ClearML orchestrator agent on the machine without needing K8S?
Yes should be quite easy.
If you intent to use containers, make sure you have docker installed.
Then just pip install clearml-agent
and configure it:
https://clear.ml/doc...
/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-py
Yep I see it now, could you simulate locally (i.e have the other folders in the path as well)?
could it be you also have a file somewhere that is called sfi or imagery or models or chip_classifier that it accidently tries to import first from ?
Hi @<1569858449813016576:profile|JumpyRaven4>
- The gunicorn logs do not show anything including any error or trace of the 502 only siege reports the 502 as well as the ALB.Is this an ALB or an ELB ?
What's the timeout its configured?
Do you have GPU instances as well? what's theclearml-serving-inference
docker version ?
Oh I see, basically a UI feature.
I'm assuming this is not just changing the x-axis in the UI, but somehow store the x-axis as part of the scalars reported?
Could I use "register artifact"
I think this is somewhat deprecated and we should probably replace it with something similar to what you mentioned (i.e. watch a file change).
Right now the easiest way would e to manually upload the trainer_state.json
every checkpoint:Task.current_task().upload_artifact('trainer_state.json
, name='state') `