Skip to content

Efficiently process millions of CSV records to find specific data points and calculate weighted statistics, optimized for repeated queries and minimal memory usage.

Notifications You must be signed in to change notification settings

atpritam/Cached-CSV-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Cached CSV Search Engine

Table of Contents

  1. Problem Overview
  2. Core Functions
  3. Architecture Highlights
  4. Real Performance Data
  5. Test Dataset

Problem Overview

Efficiently process millions of CSV records to find specific data points and calculate weighted statistics, optimized for repeated queries and minimal memory usage.

Core Functions

task1(search, data) - Single Record Search

Purpose: Find the first record matching all search criteria and return its 'value' field.

Input:

  • search: Dictionary of column-value pairs to match (e.g., {'a': 123, 'b': 456})
  • data: Raw CSV string with millions of records

Output:

  • Returns the value field from the first matching record
  • Returns '-1' if no match is found

Example:

search = {'a': 938089, 'b': 213662, 'c': 979447, 'd': 164203, 'e': 4557}
result = task1(search, csv_data)
# Returns: "866097" (the value from first matching row)

Key Optimizations:

  • LRU Caching: Identical searches return instantly (0.0000s)
  • Streaming Processing: Processes line-by-line without loading entire dataset
  • Header Caching: Avoids re-parsing CSV structure on repeated calls
  • Early Termination: Stops at first match found

task2(search_list, data) - Weighted Average Calculator

Purpose: Search for multiple records and calculate their weighted average using a specific odd/even weighting scheme.

Input:

  • search_list: List of search dictionaries to process
  • data: Raw CSV string with millions of records

Output:

  • Weighted average as a string with 1 decimal place (e.g., "543687.0")
  • Returns "0.0" if no matches found

Weighting Logic:

  • Odd values: Weight = 10
  • Even values: Weight = 20

Example:

search_list = [
    {'a': 938089, 'b': 213662, 'c': 979447, 'd': 164203, 'e': 4557},
    {'a': 21528, 'b': 434740, 'c': 253023, 'd': 60558, 'e': 616279}
]
result = task2(search_list, csv_data)
# Returns: "543687.0" (weighted average of found values)

Calculation:

Found values: [543687, 789123]  (example)
Weights: [10, 20]  (543687 is odd=10, 789123 is odd=10)
Weighted Average = (543687×10 + 789123×10) / (10 + 10) = 666405.0

Key Features:

  • Leverages task1 caching: Each search benefits from the LRU cache
  • Skips missing records: Ignores searches that return '-1'
  • Precise calculations: Maintains accuracy for large numbers
  • Handles edge cases: Returns "0.0" for no valid results

Architecture Highlights

Smart Caching System

  • Cache Size: 1000 entries (optimal for competitive code tests)
  • Cache Key: Sorted tuple of search criteria for consistency
  • Cache Invalidation: Automatically detects dataset changes via hash
  • Memory Efficient: ~230KB cache overhead regardless of dataset size

Performance Characteristics

  • Cache Hits: 0.0000s response time
  • Cache Misses: 1-7s for millions of records (depends on match location)
  • Memory Usage: <3MB additional memory regardless of dataset size
  • Scalability: Handles datasets larger than available RAM

Real Performance Data

Test Cases on 10 Million+ Record Dataset:
============================================================
Test     Result       Time (s)   Memory (MB)  Before (MB)  After (MB)
------------------------------------------------------------
Test 1   543687.0     1.6341     2.770        1379.617     1382.387
Test 2   666172.0     4.2195     -0.082       1382.387     1382.305
Test 3   714033.7     0.0000     0.000        1382.305     1382.305
Test 4   543687.0     0.0000     0.000        1382.305     1382.305
Test 5   396519.0     6.7825     0.016        1382.305     1382.320

Test Dataset

The performance benchmarks use a 400MB CSV file with millions of records:

  • Download: find_match_average.dat (400MB)
  • Format: CSV with columns a,b,c,d,e,value
  • Size: Multiple million records

This solution demonstrates production-ready code with intelligent caching, memory efficiency, and performance optimization for large-scale data processing tasks.

About

Efficiently process millions of CSV records to find specific data points and calculate weighted statistics, optimized for repeated queries and minimal memory usage.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages