Skip to content

Commit 636c880

Browse files
authored
Merge pull request #143 from terhunej/master
zETL workshop content update
2 parents 15ba0ee + 838243c commit 636c880

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+251
-207
lines changed

content/dynamodb-opensearch-zetl/integrations/index.en.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,6 @@ menuTitle: "Integrations"
44
date: 2024-02-23T00:00:00-00:00
55
weight: 30
66
---
7-
In this section, you will configure integrations between services. You'll first set up ML and Pipeline connectors in OpenSearch Service followed by a zero ETL connector to move data written to DynamoDB to OpenSearch. Once these integrations are set up, you'll be able to write records to DynamoDB as your source of truth and then automatically have that data available to query in other services.
7+
In this section, you will configure integrations between services. First you will set up machine learning (ML) and Pipeline connectors in OpenSearch Service. Then you will setup a zero-ETL connector to move data stored in DynamoDB into OpenSearch for indexing. Once both these integrations are set up, you'll be able to write records to DynamoDB as your source of truth and then automatically have that data available to query in the other services.
8+
9+
![Integrations](/static/images/connectionsandpipelines.png)

content/dynamodb-opensearch-zetl/integrations/os-connectors.en.md

Lines changed: 33 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,21 @@ menuTitle: "Load DynamoDB Data"
44
date: 2024-02-23T00:00:00-00:00
55
weight: 20
66
---
7-
In this section you'll configure ML and Pipeline connectors in OpenSearch Service. These configurations are set up by a series of POST and PUT requests that are authenticated with AWS Signature Version 4 (sig-v4). Sigv4 is the standard authentication mechanism used by AWS services. While in most cases an SDK abstracts away sig-v4 but in this case we will be building the requests ourselves with curl.
7+
In this section you'll configure OpenSearch so it will preprocess and enrich data as it is written to its indexes, by connecting to an externally hosted machine learning embeddings model. This is a simpler application design than having your application write the embeddings as an attribute in the Item within DynamoDB. Instead, the data is kept as text in DynamoDB and when it arrives in OpenSearch, OpenSearch will connect out using Bedrock to generate and store the embeddings.
88

9-
Building a sig-v4 signed request requires a session token, access key, and secret access key. You'll first retrieve these from your Cloud9 Instance metadata with the provided "credentials.sh" script which exports required values to environmental variables. In the following steps, you'll also export other values to environmental variables to allow for easy substitution into listed commands.
9+
More information on this design can be around at [ML and Pipeline connectors in OpenSearch Service](https://opensearch.org/docs/latest/ml-commons-plugin/remote-models/index/).
1010

11-
1. Run the credentials.sh script to retrieve and export credentials. These credentials will be used to sign API requests to the OpenSearch cluster. Note the leading "." before "./credentials.sh", this must be included to ensure that the exported credentials are available in the currently running shell.
12-
```bash
13-
. ./credentials.sh
14-
```
15-
1. Next, export an environmental variable with the OpenSearch endpoint URL. This URL is listed in the CloudFormation Stack Outputs tab as "OSDomainEndpoint". This variable will be used in subsequent commands.
16-
```bash
17-
export OPENSEARCH_ENDPOINT="https://search-ddb-os-xxxx-xxxxxxxxxxxxx.us-west-2.es.amazonaws.com"
18-
```
19-
1. Execute the following curl command to create the OpenSearch ML model connector.
11+
We will perform these configurations using a series of POST and PUT requests made to OpenSearch endpoints. The calls will be made using the IAM role that was previously mapped to the OpenSearch "all_access" role.
12+
13+
The calls are authenticated with AWS Signature Version 4 (sig-v4). Sigv4 is the standard authentication mechanism used by AWS services. In most cases an SDK abstracts away the sig-v4 details, but in this case we will be building the requests ourselves with curl.
14+
15+
Building a sig-v4 signed request requires a session token, access key, and secret access key. These are available to your VS Code Instance as metadata. These values were retrieved by the "credentials.sh" script you ran during setup. It pulled the required values and then exported them as environmental variables for your use. In the following steps, you'll also export other values to environmental variables to allow for easy substitution into the various commands.
16+
17+
If any of the following commands fail, try re-running the credentials.sh script in the :link[Environment Setup]{href="/setup/step1"} step.
18+
19+
As you run these steps, be very careful about typos. Also remember the Copy icon in the corner.
20+
21+
1. Execute the following curl command to **create the OpenSearch ML model connector**. You can use ML connectors to connect OpenSearch Service to a model hosted on bedrock or a model hosted on a third party platform. Here we are connecting to the Titan embedding model hosted on bedrock.
2022
```bash
2123
curl --request POST \
2224
${OPENSEARCH_ENDPOINT}'/_plugins/_ml/connectors/_create' \
@@ -53,11 +55,11 @@ Building a sig-v4 signed request requires a session token, access key, and secre
5355
]
5456
}'
5557
```
56-
1. Note the "connector_id" returned in the previous command. Export it to an environmental variable for convenient substitution in future commands.
58+
1. Note the **"connector_id"** returned in the previous command. **Export it to an environmental variable** for convenient substitution in future commands.
5759
```bash
5860
export CONNECTOR_ID='xxxxxxxxxxxxxx'
5961
```
60-
1. Run the next curl command to create the model group.
62+
1. Run the next curl command to **create the model group**.
6163
```bash
6264
curl --request POST \
6365
${OPENSEARCH_ENDPOINT}'/_plugins/_ml/model_groups/_register' \
@@ -71,7 +73,7 @@ Building a sig-v4 signed request requires a session token, access key, and secre
7173
"description": "This is an example description"
7274
}'
7375
```
74-
1. Note the "model_group_id" returned in the previous command. Export it to an environmental variable for later substitution.
76+
1. Note the **"model_group_id"** returned in the previous command. **Export it to an environmental variable** for later substitution.
7577
```bash
7678
export MODEL_GROUP_ID='xxxxxxxxxxxxx'
7779
```
@@ -92,15 +94,17 @@ Building a sig-v4 signed request requires a session token, access key, and secre
9294
"connector_id": "'${CONNECTOR_ID}'"
9395
}'
9496
```
95-
1. Note the "model_id" and export it.
97+
1. Note the **"model_id"** (NOT the task_id) and export it.
9698
```bash
9799
export MODEL_ID='xxxxxxxxxxxxx'
98100
```
99-
1. Run the following command to verify that you have successfully exported the connector, model group, and model id.
101+
1. Run the following command to **verify that you have successfully exported the connector, model group, and model id**.
100102
```bash
101103
echo -e "CONNECTOR_ID=${CONNECTOR_ID}\nMODEL_GROUP_ID=${MODEL_GROUP_ID}\nMODEL_ID=${MODEL_ID}"
102104
```
103-
1. Next, we'll deploy the model with the following curl.
105+
106+
::alert[_Make sure the environment variables are exported well. Otherwise, it will cause errors in the next commands_]
107+
1. Next, we'll **deploy the model** with the following curl.
104108
```bash
105109
curl --request POST \
106110
${OPENSEARCH_ENDPOINT}'/_plugins/_ml/models/'${MODEL_ID}'/_deploy' \
@@ -111,11 +115,13 @@ Building a sig-v4 signed request requires a session token, access key, and secre
111115
--user "${METADATA_AWS_ACCESS_KEY_ID}:${METADATA_AWS_SECRET_ACCESS_KEY}"
112116
```
113117
114-
With the model created, OpenSearch can now use Bedrock's Titan embedding model for processing text. An embeddings model is a type of machine learning model that transforms high-dimensional data (like text or images) into lower-dimensional vectors, known as embeddings. These vectors capture the semantic or contextual relationships between the data points in a more compact, dense representation.
118+
With the model created, **OpenSearch can now use Bedrock's Titan embedding model** for processing text.
115119

116-
The embeddings represent the semantic meaning of the input data, in this case product descriptions. Words with similar meanings are represented by vectors that are close to each other in the vector space. For example, the vectors for "sturdy" and "strong" would be closer to each other than to "warm".
120+
**An embeddings model** is a type of machine learning model that transforms high-dimensional data (like text or images) into lower-dimensional vectors, known as embeddings. These vectors capture the semantic or contextual relationships between the data points in a more compact, dense representation.
117121

118-
1. Now we can test the model. If you recieve results back with a "200" status code, everything is working properly.
122+
The embeddings represent the semantic meaning of the input data, in this case product descriptions. Words with similar meanings are represented by vectors that are close to each other in the vector space. For example, the vectors for "sturdy" and "strong" would be closer to each other than to "stringy".
123+
124+
1. Now we can *test the model*. With the below command, we are sending some text to OpenSearch and asking it to return the Vector embeddings using the configured "MODEL_ID". If you receive results back with a "200" status code, everything is working properly.
119125
```bash
120126
curl --request POST \
121127
${OPENSEARCH_ENDPOINT}'/_plugins/_ml/models/'${MODEL_ID}'/_predict' \
@@ -130,7 +136,9 @@ Building a sig-v4 signed request requires a session token, access key, and secre
130136
}
131137
}'
132138
```
133-
1. Next, we'll create the Details table mapping pipeline.
139+
::alert[_Output will have vector embeddings as well. So, try to find the statuscode variable to check the status._]
140+
141+
1. Next, we'll create the **ProductDetails table mapping ingest pipeline**. An **ingest pipeline** is a sequence of processors that are applied to documents as they are ingested into an index. This uses the configured model to generate the embeddings. Once this is created, as new data arrives into OpenSearch from the DynamoDB "ProductDetails" table the embeddings will be created and indexed.
134142
```bash
135143
curl --request PUT \
136144
${OPENSEARCH_ENDPOINT}'/_ingest/pipeline/product-en-nlp-ingest-pipeline' \
@@ -158,7 +166,8 @@ Building a sig-v4 signed request requires a session token, access key, and secre
158166
]
159167
}'
160168
```
161-
1. Followed by the Reviews table mapping pipeline. We won't use this in this version of the lab, but in a real system you will want to keep your embeddings indexes separate for different queries.
169+
::alert[_Here, we have created the processor which is going to take the source and create embedding which will be under 'product_embedding'_]
170+
1. Followed by the **Reviews table mapping pipeline**. We won't use this in this version of the lab, but in a real system you will want to keep your embeddings indexes separate for different queries. Note the different endpoint pipeline path.
162171
```bash
163172
curl --request PUT \
164173
${OPENSEARCH_ENDPOINT}'/_ingest/pipeline/product-reviews-nlp-ingest-pipeline' \
@@ -177,7 +186,7 @@ Building a sig-v4 signed request requires a session token, access key, and secre
177186
},
178187
{
179188
"text_embedding": {
180-
"model_id": "m6jIgowBXLzE-9O0CcNs",
189+
"model_id": "'${MODEL_ID}'",
181190
"field_map": {
182191
"combined_field": "product_reviews_embedding"
183192
}
@@ -187,4 +196,4 @@ Building a sig-v4 signed request requires a session token, access key, and secre
187196
}'
188197
```
189198

190-
These pipelines allow OpenSearch to preprocess and enrich data as it is written to the index by adding embeddings through the Bedrock connector.
199+
**These pipelines allow OpenSearch to preprocess and enrich data as it is written to the index by adding embeddings through the Bedrock connector**.

content/dynamodb-opensearch-zetl/integrations/zetl.en.md

Lines changed: 80 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -6,112 +6,96 @@ weight: 30
66
---
77
Amazon DynamoDB offers a zero-ETL integration with Amazon OpenSearch Service through the DynamoDB plugin for OpenSearch Ingestion. Amazon OpenSearch Ingestion offers a fully managed, no-code experience for ingesting data into Amazon OpenSearch Service.
88

9-
1. Open [OpenSearch Service Ingestion Pipelines](https://us-west-2.console.aws.amazon.com/aos/home?region=us-west-2#opensearch/ingestion-pipelines)
10-
1. Click "Create pipeline"
11-
12-
![Create pipeline](/static/images/ddb-os-zetl13.jpg)
13-
14-
1. Name your pipeline, and include the following for your pipeline configuration. The configuration contains multiple values that need to be updated. The needed values are provided in the CloudFormation Stack Outputs as "Region", "Role", "S3Bucket", "DdbTableArn", and "OSDomainEndpoint".
15-
```yaml
16-
version: "2"
17-
dynamodb-pipeline:
18-
source:
19-
dynamodb:
20-
acknowledgments: true
21-
tables:
22-
# REQUIRED: Supply the DynamoDB table ARN
23-
- table_arn: "{DDB_TABLE_ARN}"
24-
stream:
25-
start_position: "LATEST"
26-
export:
27-
# REQUIRED: Specify the name of an existing S3 bucket for DynamoDB to write export data files to
28-
s3_bucket: "{S3BUCKET}"
29-
# REQUIRED: Specify the region of the S3 bucket
30-
s3_region: "{REGION}"
31-
# Optionally set the name of a prefix that DynamoDB export data files are written to in the bucket.
32-
s3_prefix: "pipeline"
33-
aws:
34-
# REQUIRED: Provide the role to assume that has the necessary permissions to DynamoDB, OpenSearch, and S3.
35-
sts_role_arn: "{ROLE}"
36-
# REQUIRED: Provide the region
37-
region: "{REGION}"
38-
sink:
39-
- opensearch:
40-
hosts:
41-
# REQUIRED: Provide an AWS OpenSearch endpoint, including https://
42-
[
43-
"{OS_DOMAIN_ENDPOINT}"
44-
]
45-
index: "product-details-index-en"
46-
index_type: custom
47-
template_type: "index-template"
48-
template_content: |
49-
{
50-
"template": {
51-
"settings": {
52-
"index.knn": true,
53-
"default_pipeline": "product-en-nlp-ingest-pipeline"
54-
},
55-
"mappings": {
56-
"properties": {
57-
"ProductID": {
58-
"type": "keyword"
59-
},
60-
"ProductName": {
61-
"type": "text"
62-
},
63-
"Category": {
64-
"type": "text"
65-
},
66-
"Description": {
67-
"type": "text"
68-
},
69-
"Image": {
70-
"type": "text"
71-
},
72-
"combined_field": {
73-
"type": "text"
74-
},
75-
"product_embedding": {
76-
"type": "knn_vector",
77-
"dimension": 1536,
78-
"method": {
79-
"engine": "nmslib",
80-
"name": "hnsw",
81-
"space_type": "l2"
82-
}
83-
}
84-
}
85-
}
86-
}
87-
}
88-
aws:
89-
# REQUIRED: Provide the role to assume that has the necessary permissions to DynamoDB, OpenSearch, and S3.
90-
sts_role_arn: "{ROLE}"
91-
# REQUIRED: Provide the region
92-
region: "{REGION}"
93-
```
94-
1. Under Network, select "Public access", then click "Next".
95-
96-
![Create pipeline](/static/images/ddb-os-zetl14.jpg)
97-
98-
1. Click "Create pipeline".
9+
Please follow the steps to setup zero-ETL. Here we use the AWS Console instead of Curl commands:
10+
11+
1. Open [OpenSearch Service](https://us-west-2.console.aws.amazon.com/aos/home?region=us-west-2#opensearch) within the Console
12+
13+
2. Select **Pipelines** from the left pane and click on **"Create pipeline"**.
14+
![Create pipeline](/static/images/ddb-os-zetl13.jpg)
15+
16+
3. Select **"Blank"** from the Ingestion pipeline blueprints.
17+
![BluePrint Selection](/static/images/CreatePipeline.png)
18+
19+
4. Configure the source by selecting the source as **"Amazon DynamoDB"** and fill the details as below. Once done, click "Next"
20+
![Configure source](/static/images/configure_source.png)
21+
22+
5. Skip the **Processor** configuration
23+
24+
![Skip processor](/static/images/processor_blank.png)
25+
26+
6. Configure the sink by filling up the Opensearch details as below:
27+
![Configure Sink](/static/images/configure_sink.png)
28+
29+
7. Use the following content under **Schema mapping**:
30+
31+
```yaml
32+
{
33+
"template": {
34+
"settings": {
35+
"index.knn": true,
36+
"default_pipeline": "product-en-nlp-ingest-pipeline"
37+
},
38+
"mappings": {
39+
"properties": {
40+
"ProductID": {
41+
"type": "keyword"
42+
},
43+
"ProductName": {
44+
"type": "text"
45+
},
46+
"Category": {
47+
"type": "text"
48+
},
49+
"Description": {
50+
"type": "text"
51+
},
52+
"Image": {
53+
"type": "text"
54+
},
55+
"combined_field": {
56+
"type": "text"
57+
},
58+
"product_embedding": {
59+
"type": "knn_vector",
60+
"dimension": 1536
61+
}
62+
}
63+
}
64+
}
65+
}
66+
```
67+
68+
Once done, click on **"Next"**
69+
70+
8. Configure pipeline and then click "Next".
71+
72+
![Configure pipeline](/static/images/ddb-os-zetl14.jpg)
73+
74+
75+
9. Click "Create pipeline".
9976

10077
![Create pipeline](/static/images/ddb-os-zetl15.jpg)
10178

102-
1. **Wait until the pipeline has finished creating**. This will take 5 minutes or more.
79+
10. **Wait until the pipeline has finished creating and status is "Active"**. This will take 5 minutes or more.
10380

10481

105-
After the pipeline is created, it will take some additional time for the initial export from DynamoDB and import into OpenSearch Service. After you have waited several more minutes, you can check if items have replicated into OpenSearch by making a query in Dev Tools in the OpenSearch Dashboards.
82+
After the pipeline is created, it will take some additional time for the initial export from DynamoDB and import into OpenSearch Service. After you have waited several more minutes, you can check if items have replicated into OpenSearch by making a query using the OpenSearch Dashboards feature called Dev Tools.
10683

107-
To open Dev Tools, click on the menu in the top left of OpenSearch Dashboards, scroll down to the `Management` section, then click on `Dev Tools`. Enter the following query in the left pane, then click the "play" arrow.
84+
- To open Dev Tools, click on the menu in the top left of OpenSearch Dashboards, scroll down to the `Management` section, then click on `Dev Tools`.
85+
86+
![Devtools](/static/images/Devtools.png)
87+
88+
- Enter the following query in the left pane, then click the "play" arrow to execute it.
10889

10990
```text
11091
GET /product-details-index-en/_search
11192
```
112-
You may encounter a few types of results:
113-
- If you see a 404 error of type *index_not_found_exception*, then you need to wait until the pipeline is `Active`. Once it is, this exception will go away.
114-
- If your query does not have results, wait a few more minutes for the initial replication to finish and try again.
93+
94+
- The output will the list of documents that have all the fields mentioned under the zero-ETL pipeline mapping.
95+
96+
You may encounter a few types of results:
97+
- If you see a 404 error of type *index_not_found_exception*, then you need to wait until the pipeline is `Active`. Once it is, this exception will go away.
98+
- If your query does not have results, wait a few more minutes for the initial replication to finish and try again.
11599

116100
![Create pipeline](/static/images/ddb-os-zetl16.jpg)
117101

0 commit comments

Comments
 (0)