BattyLion34 when you are running the script locally, you have this script ( ResNetFineTune.py
) so you can run it without any issue, but when running the agent, the agent clone the repo, create an env and run the script. Now, the issue is that when trying to run the script, from the cloned repo, it can’t find it, because it’s only on your local machine, in the original git repo.
If you are entering a specific task artifact, you’ll get an Artifact
object ( trains.binding.artifacts.Artifact
)
One of the following objects Numpy.array, pandas.DataFrame, PIL.Image, dict (json), or pathlib2.Path
Also, if you used pickle
, the pickle.load
return value is returned. and for strings a txt
file (as it stored).
Hi GleamingGrasshopper63 , can you share how you are running the clearml agent (with venv, docker, if docker, your image?)?
For updating the agent configuration after its started to run, you’ll need to restart the agent 🙂
Hi WackyRabbit7 ,
If you only want to get the artifact object, you can use:
task_artifact = Task.get_task(task_id=<YOUR TASK ID>).artifacts[<YOUR ARTIFACT NAME>].get()
👍 appreciate if you can open a https://github.com/allegroai/trains/issues so we can consider this for future versions
Hey SarcasticSquirrel56 ,
are agents needed on the server installation?No, you can have it running on the same machine (there is no limitation about it), but you can run the clearml-agent in every machine and the clearml-server doesn’t need those.
how many agents are recommended?It depends on your needs. Each agent can run a single task each time, so if you have 2 agents running, you can have 2 tasks training in the same time, and N agents will give you N running tasks.
what best practice...
Hi VexedCat68 , what the dataset task status?
Can you share the exception for --gpus "0,1"
?
In the task you cloned, do you have torch as part of the requirements?
SquareFish25 Will try to reproduce it
Hi DefeatedCrab47
If you are referring to this example, examples/frameworks/tensorboardx/pytorch_tensorboardX.py, it does have only test and train steps.
If you like to plot validation together with the train, you can have the same prefix, for example when using writer.add_scalar('<prefix>/Test_Loss', ...)
, like in this example - https://demoapp.trains.allegro.ai/projects/bb21d73db5634e6d837724366397b6e2/experiments/f46160152ee54ff9863bb2efe554a6b1/output/metrics/scalar
Hi SubstantialElk6 ,
You can use any docker image you have access too.
Can you attach the logs with the error? virtualenv should be install with the clearml-agent
Hi DeliciousBluewhale87
So now you don’t have any failures but gpu usage issue? How about running the ClearML agent in docker mode? You can choose an Nvidia docker image and all the Cuda installations and configuration will be part of the image.
What do you think?
There are many ways to do so, this is an example for github action: https://github.com/allegroai/trains-actions-train-model
So for adding a model for serve with endpoint you can use
clearml-serving triton --endpoint "<your endpoint>" --model-project "<your project>" --model-name "<your model name>"
when the model is getting updated, is should use the new one
Regarding this one, as
mention, you can have a full iam role (without any credentials) in higher tiers, in the regular youll need the credentials for the auth using the boto3 commands for spin up, spin down, tags and such apis commands. The app currently is hosted by us, so you iam role won’t be really available
With it the new created instance will have the iam role associate to it too
max_spin_up_time_min
- the maximum time for an instance to spin upmax_idle_time_min
- the maximum time for an instance to stay up with worker on it (the time the ec2 instance is finished running a task with the agent and still up with an worker running on it listening to the queue)
Hi PunyBee36 ,
Thanks for reporting this, the log message will be fixed in the next clearml version, will update here about it 🙂
About the running task, I can read in the logs that a new instance was created (i-02fc8...), can you check if you have a running clearml agent on it? if so, the agent will pull the task from the queue, if not, can you check in this instance logs for errors and share?
When uploading the files, hash is being calculated for every entry, and this is done for the local files. so currently clearml-data support local files.
What would you like to do with the dataset? Why not using it directly from S3?
Hi TenseOstrich47 , currently the StorageManager
support uploading local files, what do you mean by memory?
Sure, with clearml
and clearml-agent
you get autoscaling for your machines (with monitoring) and automation for your tasks that will handle all for you (docker images, manage the credentials …).
There are much more parts in the system, so maybe you can share a use case so I can help you with it?
basically it run pip freeze
in your execution env and create the requirements according to it, without any analysis
you can this description as the preview, can this help?
task.upload_artifact(name='artifact name', artifact_object=artifact_object, preview="object description")
Didnt get such, are you using http://app.clear.ml ? your server?
Hi TenseOstrich47 , the StorageManager does use boto3 for those upload (so if its not supported by boto3, same for StorageManager :/ )
Maybe you can use the 'wait_for_upload' and delete the local file after?
Are you running the services with an agent?
Which version do you use?
Yap, this should solve the issue. Let me check the causing of it