- Introduction
- Key Features
- Frameworks
- Pipeline Flowchart
- Setup Instructions
- Google Sheets Integration
- Setting up Environment Variables
- Dashboard UI
This is an end-to-end tool that allows users to automate data retrieval from the web, preprocess and filter results, scrape content, extract relevant context, and structure the data in a user-friendly format. The dashboard integrates various AI-powered and web scraping capabilities, and it allows users to define custom search queries to retrieve the most relevant data from online sources.
- Web Scraping and Filtering: Automates Google search, URL filtering, and content scraping.
- Webpage Parsing: Parses both HTML and PDF content to extract relevant context.
- Contextual Data Retrieval: Uses embeddings to retrieve and structure relevant data.
- Asynchronous Processing: Improves efficiency for large datasets.
- Google Sheets Integration: Supports importing spreadsheets from Google Sheets.
- Backend: Python
- API: FastAPI
- UI: Streamlit
- Data Handling: Pandas
- Google Sheets Integration: gspread
- Web Search: Custom Google Search Module
- Web Scraping: Beautiful Soup, PyPDF2,
Newspaper4k(Newspaper4k results in better scraped data but takes more time) - LLM: OpenAI (gpt-4o-mini) (Use "gpt-4o" for better consistency)
- Agents: Langchain
Prerequisites:
- Python 3.10 or higher
- Pip
Clone the repository:
git clone https://github.com/suryanshgupta9933/breakoutai-assesment.git
cd breakoutai-assesment-
Create and Activate a Virtual Environment
python -m venv venv source venv/bin/activate -
Install the required dependencies:
pip install -r requirements.txt
-
Start the Application
- Run the FastAPI Backend
python routes.py
- Run the Streamlit Dashboard
streamlit run dashboard.py
- Run the FastAPI Backend
This setup uses Docker Compose and requires minimal configuration. Prerequisites:
- Docker
-
Build and run the Docker containers:
docker-compose up --build
-
Access the Streamlit dashboard at
http://localhost:8501.
In current Implementation, Google Sheets Integration is supported using Service Account. A service account is a special type of Google account intended to represent a non-human user that needs to authenticate and be authorized to access data in Google APIs.
Note: Since it’s a separate account, by default it does not have access to any spreadsheet until you share it with this account. Just like any other Google account.
- Head over to the Google Cloud Console and create a new project(or select an existing one).
- Search for
APIs & Servicesin the search bar and click onEnable APIs and Services. - Search for
Google Sheets APIand click onEnable. - Search for
Google Drive APIand click onEnable.
You have successfully enabled the Google Sheets API and Google Drive API for your project.
- Go to
APIs & Services>Credentialsand click onCreate Credentials. - Select
Service Accountand fill in the details.
Note: Copy the
Service Account Email ID. You will share your Google Sheets to this account.
- Press on: near recently created service account and select
Keysand then click onAdd Key>Create new key. - Select JSON key and press
Create. - Your Service Account JSON Key will be downloaded.
- Rename the
.env.examplefile to.envand update the environment variables.UPLOAD_ENDPOINT="http://localhost:8000/upload-csv" PIPELINE_ENDPOINT="http://localhost:8000/pipeline" OPENAI_API_KEY="your-openai-api-key" SERVICE_ACCOUNT_KEY="path/to/your/service-account-key.json" - Add your OpenAI API Key and path to your Service Account JSON Key to the
.envfile.
- You have successfully set up Google Sheets Integration. Share your Google Sheets with the Service Account Email ID and you are good to go.



