Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 32 additions & 33 deletions notebooks/classifier.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
"# Azure AI Content Understanding - Classifier and Analyzer Demo\n",
"\n",
"This notebook demonstrates how to use the Azure AI Content Understanding service to:\n",
"1. Create a classifier for document categorization\n",
"2. Create a custom analyzer to extract specific fields\n",
"3. Combine the classifier and analyzers to classify, optionally split, and analyze documents within a flexible processing pipeline\n",
"1. Create a classifier for document categorization.\n",
"2. Create a custom analyzer to extract specific fields.\n",
"3. Combine the classifier and analyzers to classify, optionally split, and analyze documents within a flexible processing pipeline.\n",
"\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Consistency]
    • change: Added periods at the end of numbered list items.
    • rationale: To conform to standard grammatical rules for complete sentences and ensure consistency across list entries.
    • impact: Enhances the professionalism and readability of the documentation by presenting a polished and uniform list format.

"For more detailed information before getting started, please refer to the official documentation:\n",
"[Understanding Classifiers in Azure AI Services](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/classifier)\n",
Expand All @@ -36,7 +36,7 @@
"\n",
"> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class that provides functions to interact with the Content Understanding API. Prior to the official release of the Content Understanding SDK, it serves as a lightweight SDK.\n",
">\n",
"> Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the details from your Azure AI Service.\n",
"> Please fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the details from your Azure AI Service.\n",
"\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity]
    • change: Changed the sentence from "> Fill in the constants ..." to "> Please fill in the constants ..."
    • rationale: Adding "Please" makes the instruction more polite and clearer as a request rather than a command.
    • impact: Improves the tone and readability of the documentation, making it more user-friendly.

"> ⚠️ Important:\n",
"You must update the code below to use your preferred Azure authentication method.\n",
Expand Down Expand Up @@ -110,20 +110,20 @@
"source": [
"## Configure Model Deployments for Prebuilt Analyzers\n",
"\n",
"> **💡 Note:** This step is only required **once per Azure Content Understanding resource**, unless the GPT deployment has been changed. You can skip this section if:\n",
"> - This configuration has already been run once for your resource, or\n",
"> - Your administrator has already configured the model deployments for you\n",
"> **💡 Note:** This step is required **only once per Azure Content Understanding resource**, unless the GPT deployment has been changed. You may skip this section if:\n",
"> - This configuration has already been completed for your resource, or\n",
"> - Your administrator has already set up the model deployments for you.\n",
"\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity]

    • change: Changed "only required once per Azure Content Understanding resource" to "required only once per Azure Content Understanding resource"
    • rationale: Corrected word order for smoother, more natural English phrasing.
    • impact: Improves sentence readability and comprehension.
  • categories: [Clarity, Consistency]

    • change: Replaced "You can skip this section if:" with "You may skip this section if:"
    • rationale: "May" implies permission more clearly and is slightly more formal and consistent for documentation tone.
    • impact: Enhances the professional tone and clarity of instructions.
  • categories: [Clarity, Grammar]

    • change: Changed "- This configuration has already been run once for your resource," to "- This configuration has already been completed for your resource,"
    • rationale: "Completed" is a clearer and more standard term when referring to finishing a configuration step.
    • impact: Reduces potential ambiguity, making instructions easier to understand.
  • categories: [Grammar, Consistency]

    • change: Revised "- Your administrator has already configured the model deployments for you" to "- Your administrator has already set up the model deployments for you." (added a period)
    • rationale: "Set up" is a more common and clear phrase for initial configuration; added punctuation for sentence completeness.
    • impact: Improves clarity and ensures consistent formatting across list items.

"Before using prebuilt analyzers, you need to configure the default model deployment mappings. This tells Content Understanding which model deployments to use.\n",
"\n",
"**Model Requirements:**\n",
"- **GPT-4.1** - Required for most prebuilt analyzers (e.g., `prebuilt-invoice`, `prebuilt-receipt`, `prebuilt-idDocument`)\n",
"- **GPT-4.1-mini** - Required for RAG analyzers (e.g., `prebuilt-documentSearch`, `prebuilt-audioSearch`, `prebuilt-videoSearch`)\n",
"- **text-embedding-3-large** - Required for all prebuilt analyzers that use embeddings\n",
"- **GPT-4.1** - Required for most prebuilt analyzers (e.g., `prebuilt-invoice`, `prebuilt-receipt`, `prebuilt-idDocument`).\n",
"- **GPT-4.1-mini** - Required for RAG analyzers (e.g., `prebuilt-documentSearch`, `prebuilt-audioSearch`, `prebuilt-videoSearch`).\n",
"- **text-embedding-3-large** - Required for all prebuilt analyzers that use embeddings.\n",
"\n",
"**Prerequisites:**\n",
"1. Deploy **GPT-4.1**, **GPT-4.1-mini**, and **text-embedding-3-large** models in Azure AI Foundry\n",
"2. Set `GPT_4_1_DEPLOYMENT`, `GPT_4_1_MINI_DEPLOYMENT`, and `TEXT_EMBEDDING_3_LARGE_DEPLOYMENT` in your `.env` file with the deployment names"
"1. Deploy **GPT-4.1**, **GPT-4.1-mini**, and **text-embedding-3-large** models in Azure AI Foundry.\n",
"2. Set `GPT_4_1_DEPLOYMENT`, `GPT_4_1_MINI_DEPLOYMENT`, and `TEXT_EMBEDDING_3_LARGE_DEPLOYMENT` in your `.env` file with the deployment names."
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Consistency]

    • change: Added periods at the end of list items in bulleted points describing model requirements.
    • rationale: To ensure proper sentence punctuation and maintain consistency across list items.
    • impact: Improves readability and professionalism of the documentation by adhering to standard grammar rules.
  • categories: [Grammar, Consistency]

    • change: Added periods at the end of numbered prerequisite list items.
    • rationale: To maintain uniform punctuation and formal writing style within lists.
    • impact: Enhances clarity and consistency, making the instructions easier to follow and visually aligned.

},
{
Expand Down Expand Up @@ -152,12 +152,12 @@
" print(f\" - {deployment}\")\n",
" print(\"\\n Prebuilt analyzers require GPT-4.1, GPT-4.1-mini, and text-embedding-3-large deployments.\")\n",
" print(\" Please:\")\n",
" print(\" 1. Deploy all three models in Azure AI Foundry\")\n",
" print(\" 1. Deploy all three models in Azure AI Foundry.\")\n",
" print(\" 2. Add the following to notebooks/.env:\")\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar]
    • change: Added a period at the end of the printed sentence "Deploy all three models in Azure AI Foundry."
    • rationale: To complete the sentence properly with correct punctuation.
    • impact: Enhances the professionalism and readability of the printed message in the code.

" print(\" GPT_4_1_DEPLOYMENT=<your-gpt-4.1-deployment-name>\")\n",
" print(\" GPT_4_1_MINI_DEPLOYMENT=<your-gpt-4.1-mini-deployment-name>\")\n",
" print(\" TEXT_EMBEDDING_3_LARGE_DEPLOYMENT=<your-text-embedding-3-large-deployment-name>\")\n",
" print(\" 3. Restart the kernel and run this cell again\")\n",
" print(\" 3. Restart the kernel and run this cell again.\")\n",
"else:\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar]
    • change: Added a period at the end of the printed instruction "Restart the kernel and run this cell again."
    • rationale: The addition of the period completes the sentence grammatically, making it a proper instruction.
    • impact: Enhances the professionalism and readability of the output message by adhering to proper sentence punctuation.

" print(f\"📋 Configuring default model deployments...\")\n",
" print(f\" GPT-4.1 deployment: {GPT_4_1_DEPLOYMENT}\")\n",
Expand All @@ -179,8 +179,8 @@
" except Exception as e:\n",
" print(f\"❌ Failed to configure defaults: {e}\")\n",
" print(f\" This may happen if:\")\n",
" print(f\" - One or more deployment names don't exist in your Azure AI Foundry project\")\n",
" print(f\" - You don't have permission to update defaults\")\n",
" print(f\" - One or more deployment names don't exist in your Azure AI Foundry project.\")\n",
" print(f\" - You don't have permission to update defaults.\")\n",
" raise\n"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Consistency]
    • change: Added periods at the end of the two printed sentences.
    • rationale: The original messages lacked terminal punctuation; adding periods improves grammatical completeness and maintains consistency in message formatting.
    • impact: Enhances the professionalism and readability of the printed error messages, making them clearer to the user.

]
},
Expand All @@ -189,19 +189,19 @@
"metadata": {},
"source": [
"## Create a Basic Classifier\n",
"Classify document from URL using begin_classify API.\n",
"Classify a document from a URL using the `begin_classify` API.\n",
"\n",
"High-level steps:\n",
"1. Create a custom classifier\n",
"2. Classify a document from a remote URL\n",
"3. Save the classification result to a file\n",
"4. Clean up the created classifier\n",
"1. Create a custom classifier.\n",
"2. Classify a document from a remote URL.\n",
"3. Save the classification result to a file.\n",
"4. Clean up the created classifier.\n",
"\n",
"In Azure AI Content Understanding, classification is integrated directly into the analyzer operation rather than requiring a separate API. To create a classifier, you define **`contentCategories`** within the analyzer's configuration, specifying up to 200 category names and descriptions that the service will use to categorize your input files. \n",
"In Azure AI Content Understanding, classification is integrated directly into the analyzer operation rather than requiring a separate API. To create a classifier, you define **`contentCategories`** within the analyzer's configuration, specifying up to 200 category names and descriptions that the service will use to categorize your input files.\n",
"\n",
"The **`enableSegment`** parameter controls how the classifier handles multi-document files: when set to `true`, it automatically splits and classifies different document types within a single file (useful for processing combined documents like a loan application package containing multiple forms), while setting it to `false` treats the entire file as a single document. \n",
"The **`enableSegment`** parameter controls how the classifier handles multi-document files: when set to `true`, it automatically splits and classifies different document types within a single file (useful for processing combined documents like a loan application package containing multiple forms). When set to `false`, it treats the entire file as a single document.\n",
"\n",
"For more detailed information about classification capabilities, best practices, and advanced scenarios, see the [Content Understanding classification documentation](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/classifier)."
"For more detailed information about classification capabilities, best practices, and advanced scenarios, please see the [Content Understanding classification documentation](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/classifier)."
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Grammar]

    • change: Changed the phrase "Classify document from URL using begin_classify API." to "Classify a document from a URL using the begin_classify API."
    • rationale: Added the article "a" and wrapped the API name in backticks to improve readability and highlight the API reference.
    • impact: Improves clarity and makes the API name visually distinct for readers.
  • categories: [Consistency, Grammar]

    • change: Added periods at the end of each numbered high-level step.
    • rationale: Ensures sentence completeness and maintains consistent punctuation across the list items.
    • impact: Enhances professionalism and readability of the documentation.
  • categories: [Formatting]

    • change: Removed trailing spaces after sentences in the paragraph about classification integration.
    • rationale: Eliminates unnecessary whitespace for cleaner formatting.
    • impact: Produces a more polished and consistent document format.
  • categories: [Clarity, Grammar]

    • change: Replaced a long sentence about the enableSegment parameter with two shorter, clearer sentences.
    • rationale: Splitting the sentence improves readability and better explains the parameter’s behavior in both cases.
    • impact: Makes it easier for users to understand the function of the enableSegment parameter.
  • categories: [Clarity, Tone]

    • change: Added "please" before "see the [Content Understanding classification documentation]" in the final sentence.
    • rationale: Makes the reference to additional resources more polite and inviting.
    • impact: Enhances user engagement and maintains a friendly tone in the documentation.

},
{
Expand Down Expand Up @@ -319,9 +319,9 @@
" print(f\" Segment ID: {segment.get('segmentId', 'N/A')}\")\n",
" print(\"=\" * 50)\n",
" else:\n",
" print(\"No contents available in analysis result\")\n",
" print(\"No contents available in analysis result.\")\n",
"else:\n",
" print(\"No analysis result available\")"
" print(\"No analysis result available.\")"
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Consistency]
    • change: Added a period at the end of the print statements "No contents available in analysis result" and "No analysis result available."
    • rationale: Completing the sentences with a period improves grammatical correctness and ensures consistency in message formatting.
    • impact: Enhances the professionalism and readability of the output messages, making them clearer and more polished.

},
{
Expand All @@ -348,7 +348,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean up the created analyzer \n",
"## Clean up the created analyzer\n",
"After the demo completes, the classifier is automatically deleted to prevent resource accumulation."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Formatting]
    • change: Removed the extra space before the newline character at the end of the heading line "## Clean up the created analyzer"
    • rationale: Eliminating trailing whitespace ensures consistent formatting and adheres to best practices in code and documentation formatting.
    • impact: Improves the cleanliness and professionalism of the documentation by avoiding unnecessary trailing spaces.

]
},
Expand Down Expand Up @@ -386,7 +386,7 @@
"# Define custom analyzer as a dictionary\n",
"custom_analyzer = {\n",
" \"baseAnalyzerId\": \"prebuilt-document\",\n",
" \"description\": \"Loan application analyzer - extracts key information from loan applications\",\n",
" \"description\": \"Loan application analyzer - extracts key information from loan applications.\",\n",
" \"config\": {\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar]
    • change: Added a period at the end of the description sentence.
    • rationale: The sentence lacked proper punctuation, and adding a period completes the sentence grammatically.
    • impact: Improves professionalism and readability of the description.

" \"returnDetails\": True,\n",
" \"enableLayout\": True,\n",
Expand Down Expand Up @@ -447,7 +447,7 @@
"source": [
"## Create an Enhanced Classifier with Custom Analyzer\n",
"\n",
"Now create a new classifier that uses the prebuilt invoice analyzer for invoices and the custom analyzer for loan application documents.\n",
"Now, create a new classifier that uses the prebuilt invoice analyzer for invoices and the custom analyzer for loan application documents.\n",
"This combines document classification with field extraction in one operation."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar]
    • change: Added a comma after "Now" in the sentence.
    • rationale: The introductory word "Now" should be followed by a comma for correct punctuation and readability.
    • impact: Improves the grammatical correctness and flow of the documentation, making it easier to read and understand.

]
},
Expand Down Expand Up @@ -573,7 +573,6 @@
" else:\n",
" print(f\" (No custom fields extracted for this category)\")\n",
" \n",
" \n",
" print(\"\\n\" + \"=\" * 80)\n",
" \n",
" # Display document information for the first segment\n",
Expand All @@ -586,9 +585,9 @@
" unit = first_content.get(\"unit\", \"units\")\n",
" print(f\"Page dimensions: {pages[0].get('width')} x {pages[0].get('height')} {unit}\")\n",
" else:\n",
" print(\"No contents available in enhanced analysis result\")\n",
" print(\"No contents available in enhanced analysis result.\")\n",
"else:\n",
" print(\"No enhanced analysis result available\")"
" print(\"No enhanced analysis result available.\")"
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Consistency]
    • change: Added missing periods at the end of print statement strings.
    • rationale: To ensure proper sentence punctuation and maintain consistency in messaging style.
    • impact: Enhances readability and professionalism of output messages, providing a consistent user experience.

},
{
Expand Down Expand Up @@ -673,4 +672,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Formatting]
    • change: Added a missing newline after the closing brace in the code.
    • rationale: Ensuring proper formatting and adherence to coding style guidelines that typically require a newline at the end of files or code blocks.
    • impact: Improves readability and prevents potential issues with tools or compilers that expect a final newline.

Loading