Repository for the Lighthouse sensor bot project.
This project initiated with the pre-study: "Large Language Models (LLMs) in Maritime Data Analysis and Decision Support." Read more about it here: Large Language Models (LLMs) can solve operational challenges
Lighthouse Sensor Bot is a data analysis application that uses natural language queries to analyze maritime ferry data using agentic RAG. The system consists of:
- Frontend: Interface for submitting queries, and viewing results
- Backend: Flask server that processes queries with the help of an LLM as an agent
- Database: PostgreSQL database for storing query results and model evaluations
- Docker installed on your system.
- OpenRouter API key for accessing language model functionalities.
- OpenAI API key used when evaluating model responses using RAGAS metrics.
- DeepSeek API key optional, used when evaluating model responses using RAGAS metrics. (Much cheaper than OpenAI.)
(You can choose between OpenAI and DeepSeek for evaluation, however, OpenAI key is required in either case for creating embeddings. If a DeepSeek key is provided in the backend .env file, it will be used for evaluation.)
- RAGAS App token optional, used for uploading evaluation results to RAGAS app dashboard.
To run the application as Docker containers,
Create a .env file in the backend directory.
Copy the backend-example.env file to .env and update the variables.
For the frontend, create a .env.development.local file. (This can be omitted if you don't want to run the frontend in development mode from terminal.)
Copy the frontend-example.env.development.local file to .env.development.local and update the variable if needed.
Create a .env file in the root directory.
Copy the root-example.env file to .env and update the variables.
Change the Dockerfile in the backend directory to use the arm64 python image.
In the root directory, run:
docker-compose up -d --buildThe PostgresQL database will be automatically initialized with the neccessary schema and data. The frontend will be available at http://localhost:3000.
- Select a language model from the dropdown
- Enter your question about ferry data (e.g., "What is the average speed of ferry Jupiter?")
- View the response, including SQL queries executed
To evaluate a model, click the "Evaluate" button. This will run predefined queries and evaluate the model's performance using RAGAS metrics. You cannot submit your own queries when evaluating, due to the nature of the RAGAS evaluation requring a ground truth and reference context.
In the Evaluation tab, you can see the average RAGAS scores for each model with graphs.
If you encounter any issues, please do the following:
- Ensure that the Docker containers are running properly.
- Check the logs for any error messages.
- Ensure that the environment variables are correctly set.
The application includes powerful command-line tools for running and analyzing evaluation tests. These tools allow you to evaluate model performance and view results without using the web interface.
Navigate to the backend directory and use the evaluation test runner:
cd backend
python run_evaluation_tests.pyInteractive Model Selection:
python run_evaluation_tests.py
# Lists available models and prompts for selectionRun Tests for Specific Model:
python run_evaluation_tests.py --model "google/gemini-2.5-flash-preview"
python run_evaluation_tests.py --model "anthropic/claude-3.7-sonnet"
python run_evaluation_tests.py --model "openai/gpt-4o-2024-11-20"Run Specific Test Cases:
python run_evaluation_tests.py --tests 1,3,5 # Run specific test numbersAdvanced Options:
python run_evaluation_tests.py --model "gpt-4o" --runs 3 --max-retries 2 --tests 1,2,3--runs: Number of evaluation runs per test case (default: 1)--max-retries: Maximum retry attempts for failed tests (default: 1)--test-cases: Specific test cases to run (default: all 10 tests)
$ python run_evaluation_tests.py --model "google/gemini-2.5-flash-preview" --tests 1,2,3
π Starting Evaluation Tests
========================================
Model: google/gemini-2.5-flash-preview
Test Cases: 1-3 (3 tests)
Runs per test: 1
Max retries: 1
π Test Progress:
β
Test 1/3: What is the average speed of ferry Jupiter? - PASSED
β
Test 2/3: How many passengers did Vaxholmsleden carry in total? - PASSED
β οΈ Test 3/3: Which route has the highest carbon emissions? - RETRY 1
β
Test 3/3: Which route has the highest carbon emissions? - PASSED
π Evaluation Complete!
Total Tests: 3
Passed: 3
Failed: 0
Success Rate: 100%Use the test results viewer to analyze evaluation data:
cd backend
python view_test_results.pyRecent Test Results (Default):
python view_test_results.py # Show last 20 results
python view_test_results.py --limit 50 # Show last 50 resultsModel Performance Summary:
python view_test_results.py --summary # Aggregated metrics by modelDetailed Results with Full Text:
python view_test_results.py --detailed # Show queries and responses
python view_test_results.py --detailed --limit 5 # Limit to 5 detailed resultsFilter by Specific Model:
python view_test_results.py --model "google/gemini-2.5-flash-preview"
python view_test_results.py --model "openai/gpt-4o-2024-11-20" --limit 10Overall Statistics:
python view_test_results.py --stats # Test statistics and averages
python view_test_results.py --list-models # List all tested models$ python view_test_results.py --summary
π Model Performance Summary (3 entries):
+--------------------------------+-------------+---------------------------+---------------------------+
| model_name | model_type | query_evaluation_count | avg_factual_correctness |
+================================+=============+===========================+===========================+
| google/gemini-2.5-flash-prev | proprietary | 18 | 0.321 |
| anthropic/claude-3.7-sonnet | proprietary | 5 | 0.456 |
| openai/gpt-4o-2024-11-20 | proprietary | 12 | 0.678 |
+--------------------------------+-------------+---------------------------+---------------------------+The system evaluates models using RAGAS metrics:
- Factual Correctness: How factually accurate the response is
- Semantic Similarity: How semantically similar the response is to the ground truth
- Context Recall: How well the model uses retrieved context
- Faithfulness: How faithful the response is to the provided context
- Token Usage: Prompt, completion, and total tokens used
You can also access evaluation functionality via API:
# Run evaluation via API
curl -X POST http://localhost:5001/api/evaluate \
-H "Content-Type: application/json" \
-d '{
"model_id": "google/gemini-2.5-flash-preview",
"number_of_runs": 1,
"max_retries": 1
}'
# Get performance data
curl http://localhost:5001/api/model-performanceTest cases are defined in backend/app/ragas/test_cases/synthetic_test_cases.json and include:
- Ferry speed analysis questions
- Passenger traffic queries
- Route comparison questions
- Environmental impact analysis
- Operational efficiency metrics
- Time-based traffic patterns
- Cross-route comparisons
- Data aggregation queries
- Complex analytical questions
- Multi-criteria analysis
Each test case includes:
- Query: The question to ask the model
- Ground Truth: Expected answer for evaluation
- Contexts: Reference information for context evaluation
Database Connection Errors:
- Ensure PostgreSQL container is running:
docker-compose ps - Check database logs:
docker-compose logs postgres
Evaluation Test Failures:
- Verify API keys are set correctly in
.envfiles - Check model availability with:
python view_test_results.py --list-models - Review test logs for specific error messages
Performance Issues:
- Large models may take several minutes per test
- Consider using
--test-casesto run smaller test subsets - Monitor token usage with
--detailedresults view
Every now and then a model will try to use a tool that doesn't exist or other errors from the LLMwill occur. Currently this results in a 500 status code as response. We are working on a more graceful solution.