
Reputation
Badges 1
25 × Eureka!Try removing this magic environment that tells the sub-process there was already an Initialized Task.
import os env = dict(**os.environ) env.pop('TRAINS_PROC_MASTER_ID', None)
š
Our remote machine is Windows 10
JumpyDragonfly13 seems like the Windows 10 + docker is the issue (that would explain the OCI error)
Is this relevant ?
https://github.com/microsoft/WSL/issues/5100
Hi @<1628565287957696512:profile|AloofBat92>
Yeah the name is confusing, we should probably change that. The idea is it is a low code / high code , train your own LLM and deploy it. Not really chatgpt 1:1 comparison, more like, GenAI for enterprises. make sense ?
LOL love that approach.
Basically here is what I'm thinking,
` from clearml import Task, InputModel, OutputModel
task = Task.init(...)
run this part once
if task.running_locally():
my_auxiliary_stuff = OutputModel()
my_auxiliary_stuff.system_tags = ["DATA"]
my_auxiliary_stuff.update_weights_package(weights_path="/path/to/additional/files")
input_my_auxiliary = InputModel(model_id=my_auxiliary_stuff.id)
task.connect(input_my_auxiliary, "my_auxiliary")
task.execute_remotely()
my_a...
Ohh I see now, okay there are two entries on an artifact, the actual artifact (link to file somewhere) and the text preview of the artifact . I think the "preview" is the issue
could you remove it and test ?
MysteriousBee56 not a different port, just not with "localhost" but with your machine's IP
Hi @<1523711619815706624:profile|StrangePelican34>
if I am trying to deploy 100 models on a GPU that can handle 5 concurrently,
Main limitation is Triton's ability to dynamically load / unload models. We know Nvidia is adding this capability, but I think this is still not out, once they support it, it should be transparent
CLI? Code ?
It's the same but done from outside, you want the same and "offline" as well right?
Hi @<1551376687504035840:profile|StraightSealion9>
AWS Autoscaler to create a new instance when you enqueue a task to the relevant queue.
Does that mean that you were able to enqueue a Task and have it launch on the remote EC2 machine ?
Hi SarcasticSparrow10
You will need to habe multiple trains-agent
s but they will be sharing the same queue (i.e. pulling jobs from the same queue the HPO process is pushing to)
Make sense ?
Iām not sure if this was solved, but I am encountering a similar issue.
Yep, it was solved (I think v1.7+)
With
spawn
and
forkserver
(which is used in the script above) ClearML is not able to automatically capture PyTorch scalars and artifacts.
The "trick" is to have Task.init before you spawn your code, then (since your code will not start from the same state), you should call Task.current_task(), which would basically make sure everything is...
Hi @<1556812486840160256:profile|SuccessfulRaven86>
Please notice that the clearml serving is not designed for public exposure, it lacks security layer, and is designed for easy internal deployment. If you feel you need the extra security layer I sugget either add external JWT alike authentication, or talk to the clearml people, their paid tiers include enterprise grade security on top
How so? Installing a local package should work, what am I missing?
@<1523712386849050624:profile|NastyFox63>
is there a limit to the search depth for this?
Yes, the Task.init auto package listing is Only the first depth (i.e. directly imported),
the reason is that the derivative packages should be resolved by pip, when the agent remotely executes that Task.
Now when the Agent is installing the task the Entire python environment is stored, so that it is always fully reprpoducible,
Make sense ?
Hi @<1545216070686609408:profile|EnthusiasticCow4>
The auto detection of clearml is based on the actual imported packages, not the requirements.txt of your entire python environment. This is why some of them are missing.
That said you can always manually add them
Task.add_requirements("hydra-colorlog") # optional add version="1.2.0"
task = Task.init(...)
(notice to call before Task.init)
Hi @<1523701797800120320:profile|SteadySeagull18>
...the job -> requeue it from the GUI, then a different environment is installed
The way that it works is, in the "originating" (i.e. first manual) execution only the directly imported packages are listed (no derivative packages that re required by the original packages)
But when the agent is reproducing the job, it creates a whole clean venv for the experiment, installs the required packages, then pip resolves the derivatives, and ...
yes you are correct, I would expect the same.
Can you try manually importing pt, and maybe also moving the Task.init before darts?
In the installed packages section it includes
pywin32 == 303
even though that is not in my requirements.txt.
So for some reason it is being detected (meaning your code base actually imports it in code)
But you can just remove it, either by manually editing the cloned Task (right click, reset, then you can edit the section), or via codeTask.ignore_requirements("pywin32") task = Task.init(...)
If i point directly to the data.yaml the training starts without any problem
what do you mean? how do you know where the extracted file is?
basically:
data_path = Dataset.get(...).get_local_copy()
then you should be able to open your file with open(data_path + "/data.yaml", "rt")
doe that work?
Hi AstonishingRabbit13
is there option to omit the task_id so the final output will be deterministic and know prior to the task run?
Actually no š the full path is unique for the run, so you do not end up overwriting models.
You can get the full path from the UI (Models Tab) or programmatically with Models.query_models or using the Task.get_task methods.
What's the idea behind a fixed location for the model?
Yea the "-e ." seems to fit this problem the best.
š
It seems like whatever I add to
docker_bash_setup_script
is having no effect.
If this is running with the k8s glue, there console out of the docker_bash_setup_script ` is currently Not logged into the Task (this bug will be solved in the next version), But the code is being executed. You can see the full logs with kubectl, or test with a simple export test
docker_bash_setup_script
` export MY...
But this is clearml python package, it is not really related to the server. Could it be you also update the clearml package ?
now it stopped working locally as well
At least this is consistent š
How so ? Is the "main" Task still running ?
RC is out, SmugSnake6 please try withpip install clearml==1.7.2rc1
if so is there any doc/examples about this?
Good point, passing to docs š
https://github.com/allegroai/clearml/blob/51af6e833ddc5a8ba1efaaf75980f58616b25e85/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py#L123
I mean it is mentioned, but we should highlight it better
Hmm yes this is exactly what should not happen š
Let me check it
Hi SmarmyDolphin68
I see this in between my training epochs, what could be causing this?
This is basically saying we are saving a second model on the same Task and even though both are logged, only the last is stored on the Task itself.
This will change as in the next version a Task will be able to hold reference to multiple models in the artifactory š