Data Sources & Architecture
Every fact in Civic Spiegel traces back to a primary government source. No third-party analysis, no aggregator bias — raw official records, indexed and made searchable.
Coverage & live index
Reference figures for the architecture, plus counts from the production database (this page refetches about every minute).
Reference metrics
Live from database
How the Pipeline Works
Scrape & Fetch
GitHub Actions triggers daily. Python scrapers hit Legistar, NYS Senate Open Legislation, and NYC Open Data endpoints. TypeScript scrapers fetch member HTML from council, assembly, and senate sites.
Extract & Clean
Raw HTML is stripped. PDF transcripts are parsed for plain text. Junk content (empty strings, procedural placeholders) is filtered out in base_scraper.py before any storage occurs.
Classify
tag_classifier.py uses keyword matching and spaCy NER to tag each document with policy areas (Housing, Transit, Budget…), affected demographics, jurisdiction level, and policy stage.
Chunk & Embed
Documents are split into ~500-word sentence-aware chunks with 50-word overlap. Each chunk is passed through BAAI/bge-small-en-v1.5 to produce a 384-dim float32 vector.
Store
Chunks and their embeddings are inserted into Neon Postgres. Deduplication checks source_url before insert. The politicians.json cache is committed back to the repo.
Retrieve
At query time, the user's message (augmented with borough/ZIP context) is embedded and compared via pgvector cosine distance. Falls back to lexical search, then recency, if vector recall is empty.
Generate
Top chunks are passed to Groq (Llama 3.1 8B) with the user's demographic context. The LLM returns either a structured JSON briefing or plain markdown depending on the interface.
NYC City Council
New York State Legislature
Federal Representation
Geospatial & Boundary Data
Technical Infrastructure
Next.js 14 + Tailwind CSS
App Router with server components, edge-compatible API routes, and a 24-hour CDN cache on the politicians route. Static GeoJSON boundary files are served from /public, with dynamic imports for glassmorphism/animation.
localStorage (browser-native)
Three separate localStorage keys for user demographics, accessibility settings (7 toggle states), and theme (light/dark).
FastAPI + Python
Async REST API hosted on Render (free tier). Handles RAG orchestration, pgvector similarity search, and LLM generation. SlowAPI rate-limiting at 10 req/min per IP. CORS restricted to production and localhost:3000.
Web Speech API (browser-native)
SpeechSynthesisUtterance used by AccessibilityWidget for text-to-speech. Reads selected text or full page main content. No external dependency — uses browser's built-in voices.
Cheerio + BeautifulSoup
Cheerio (Node/TypeScript) handles council.nyc.gov, nyassembly.gov, and house.gov scraping. BeautifulSoup (Python) handles RSS and supplementary HTML sources in the pipeline.
react-simple-maps + Leaflet.js
react-simple-maps renders the NYC Council district choropleth (GeoJSON projection). Leaflet.js (dynamically imported) renders the Civic Hub map with pins and toggleable boundary layers.
Groq Cloud — Llama 3.1 8B
RAG-grounded responses and structured briefings. Sub-second inference via Groq's LPU hardware. Handles structured JSON (Policy Synthesis sections) and plain markdown (Ask Spiegel chat).
OpenAI GPT-4o-mini
Used when Groq/RAG returns insufficient context (retrieval_tier='none'). Two paths: /api/llm/chat (direct, for /chat page) and /api/civic/floating-chat orchestration (fallback when RAG fails).
Neon Serverless Postgres + pgvector
Serverless Postgres with pgvector extension. Stores 384-dimension float32 embeddings in DocumentChunk rows alongside plain text, enabling cosine-similarity search without a separate vector store.
FastEmbed — BAAI/bge-small-en-v1.5
ONNX-based embedding model generating 384-dim vectors at ingest time (pipeline) and query time (backend). Falls back to HuggingFace Inference API when local model is unavailable.
spaCy NLP Pipeline
en_core_web_sm model used in tag_classifier.py to extract named entities (politicians, agencies, locations). Also used for keyword classification of documents into policy areas and affected demographics.
GitHub Actions (×2 workflows)
Two scheduled workflows run daily at 06:00 UTC: run_pipeline.yml ingests new legislative documents into Neon; refresh-politicians.yml scrapes fresh member data and commits politicians.json to the repo.
Civic Resources for Residents
These aren't data sources we ingest — they're tools we recommend to users who want to go deeper.
mygovnyc.org
Enter any NYC address to see every elected rep at the city, state, and federal levels.
NYC Boundary Explorer (BetaNYC)
View overlapping civic boundaries — council districts, community boards, school districts, and precincts — on one map.
NYC Council Portal
Browse bills, votes, hearings, and member profiles. The primary public interface for NYC legislation.
Legistar Calendar
Full Council and committee schedule with agendas, minutes, and vote records.
NYC Rules Calendar
Official rulemaking calendar detailing proposed rule modifications, public hearings, and newly adopted regulations across NYC departments.
NYC Community Boards
Directory of all 59 community boards and their meeting schedules.
NYS Senate Find My Senator
Official address-based lookup for your State Senator.
Find Your U.S. Representative
Locate your federal House representative by address or ZIP.
BallotReady NY
Nonpartisan voter guide covering every race on your ballot in New York.
Database Model
A single Neon Postgres instance handles relational data and vector search via pgvector — no separate vector store needed.
PoliticianName, party, role, district, borough, bio URL
DistrictZIP codes, NTAs, and jurisdiction per district number
LegislationEventBills, resolutions, status, date, URL
VoteRecordLinks politician ↔ event with vote_cast value
PolicyDocumentScraped source with metadata_tags JSON
DocumentChunkText chunk + 384-dim pgvector embedding
Retrieval Strategy
Three-tier fallback ensures answers even when vector recall is sparse.
pgvector cosine similarity on embedded query (augmented with borough/ZIP context)
ILIKE full-text keyword match across chunk text when vector recall is empty
Most recently ingested chunks returned if both vector and lexical fail
Automated Refresh
Two GitHub Actions workflows keep data current without manual intervention.
run_pipeline.ymlScrapes Legistar, NYS Senate bills, meeting records; embeds and upserts into Neon.
refresh-politicians.ymlScrapes all 290+ member pages, builds politicians.json, commits back to the repo.
No ads, no sponsored content. All data is sourced from official government APIs and public HTML pages. Civic Spiegel does not sell or share user data. The representative directory is non-partisan and pulls directly from official government sites.