PerplexedShells66

3 Questions, 31 Answers

Active since 20 May 2025

Last activity 5 months ago

Reputation

Badges 1

31 × Eureka!

Questions 3
Answers 31

0 Votes

1 Answers

679 Views

0 Votes 1 Answers 679 Views

Hi Everyone, I'M Trying To Track Ml Experiments Running Inside Sagemaker; However, Because It'S Running Inside Docker, I'M Unable To Capture Important Data (Commit Id). Is There A Way To Do That (Without Cloning The Repo)? Thanks!

Hi everyone, I'm trying to track ML experiments running inside SageMaker; however, because it's running inside docker, I'm unable to capture important data (...

clearml

6 months ago

0 Votes

51 Answers

22K Views

0 Votes 51 Answers 22K Views

Anyone Faced An Issue With Elasticsearch Before

Anyone faced an issue with elasticsearch before h8q2wwsd3o-algo-1-vnnpg | 2025-05-19 11:41:21,688 - clearml.log - WARNING - failed logging task to backend (1...

clearml

6 months ago

0 Votes

3 Answers

579 Views

0 Votes 3 Answers 579 Views

Hi! I'M Running Both Training And Inference Using Aws Sagemaker Pipelines, And I Want To Track Experiment Runs Using Clearml. I Was Able To Integrate Clearml At Runtime (Task.Set_Credentials And Task.Init()), And I Was Able To Track Metrics, Artifacts, E

Hi! I'm running both training and inference using AWS SageMaker pipelines, and I want to track experiment runs using ClearML. I was able to integrate ClearML...

clearml

5 months ago

0 Anyone Faced An Issue With Elasticsearch Before

I tested that theory before; I commented out these two lines

output_model = OutputModel(task=task, name="trained_model")
output_model.update_weights(register_uri=s3_model_uri)

The issue, however, persisted.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

green open events-log-d1bd92a3b039400cbafc60a7a5b1e52b            Yh4BPGmgRZKU7STdCghmtw 1 0   96 0 175.1kb 175.1kb 175.1kb

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

@<1523701070390366208:profile|CostlyOstrich36>
I've updated the instance type to t3a.large.
The issue persisted.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

console (client).

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

it's

ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
fzd6tw0x46-algo-1-lswt4  | 2025-05-20 10:02:08,177 - urllib3.connectionpool - WARNING - Retrying (Retry(total=2, connect=2, read=5, redirect=5, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x76f683ccb670>, 'Connection to "" timed out. (connect timeout=300.0)')': /
fzd6tw0x46-algo-1-lswt4  | 2025-05-20 10:02:08,178 - ur...

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

Alright, it's running...

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

no, it's something else.

I commented out the above two line and I was still facing the issue.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

@<1722061389024989184:profile|ResponsiveKoala38> I'm looking at the logs now (used "docker logs clearml-elastic").

The status seemed to had transitioned, but the it's not clear the error.

{"@timestamp":"2025-05-20T08:36:18.412Z", "log.level": "INFO", "message":"setting file [/usr/share/elasticsearch/config/operator/settings.json] not found, initializing [file_settings] as empty", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.nam...

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

That's a big context!
In general, I'm using standard functions; the script is running in SageMaker pipeline.
The model, however, is a composite, and consists of multiple primitive ones.


task = Task.init(
    project_name="icp",
    task_name=f"model_training_{client_name}",
    task_type=Task.TaskTypes.training,
    auto_connect_frameworks={'matplotlib': True, 'tensorflow': False, 
                             'tensorboard': False,
                            'pytorch': False, 'xgboos...

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>
it's ClearML, I commented out clearml lines, and it ran successfully!

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

I'm begining to think that there is something besides ClearML. I'll execute the training script on remote (SageMaker), instead of SageMaker local mode.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

to close this thread, file server port wasn't configured
I added

        - IpProtocol: tcp
          FromPort: 8081
          ToPort: 8081
          CidrIp: 0.0.0.0/0

to cloudformation template, and it was resolved.

Thanks a bunch, guys
@<1722061389024989184:profile|ResponsiveKoala38> @<1523701070390366208:profile|CostlyOstrich36>

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

it's behaving very strangely.

I'm trying to provision the instance, but something is off.
It's as if some functionalities are missing.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

ok, I'm recreating the ec2 isntance to generate ssh key pair, then I'll check the elasticsearch logs.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

I have been rerunning it since yesterday. The error persists.

I can try one more time though.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

Also, (without CLearML) the model artifacts are uploaded/downloadable.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

so, the same ClearML monitor error, but another issue now.

btw, the task logs the configuration, artifacts, etc.
I get this error at the end.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

ecc2, after I ssh-ed the instance.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

ok. Currently the ebs is 15 GB, is there a recommended size?

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "No shard was specified in the request which means the response should explain a randomly-chosen unassigned shard, but there are no unassigned shards in this cluster. To explain the allocation of an assigned shard you must specify the target shard in the request. See

 for more information."
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "N...

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

I need to ssh the instance, right?
I'll check it out.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

Also, it would be great if you could add a recommendation for EBS size in this guide ( None ),
The Elastic Search issue happened with 8 GB, and was resolved with 15 GB.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

curl -XGET "localhost:9200/_cluster/allocation/explain?pretty"
curl: (7) Failed to connect to localhost port 9200 after 0 ms: Couldn't connect to server

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

I don't store anything on clearml server; everything is being stored in S3 and referenced by ClearML.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

I tried deleting all the underlying resources: ec2 & ebs, and recreating it again.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

@<1722061389024989184:profile|ResponsiveKoala38> It's not resolved.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

The.... are model-specific logs.

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

ClearML Task: created new task id=f08b012bce42420dba7cd166668f5e4b
2025-05-20 09:54:59,251 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: /projects/184c6e8651d94b9088ae60ae3a9c8ace/experiments/f08b012bce42420dba7cd166668f5e4b/output/log
2025-05-20 12:55:02
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Starting the training.

....

ClearML Monitor: Could not detect iteration reporting, falling back...

6 months ago

0 Anyone Faced An Issue With Elasticsearch Before

It's the entire error repeating.
And, this happens at the end of the script.

I'm using the recommended instance (t3.large).

6 months ago

Show more results