Civic Spiegel

Data Sources & Architecture

Every fact in Civic Spiegel traces back to a primary government source. No third-party analysis, no aggregator bias — raw official records, indexed and made searchable.

Coverage & live index

Reference figures for the architecture, plus counts from the production database (this page refetches about every minute).

Reference metrics

290+
Representatives tracked
6
Official data tiers
384
Embedding dimensions
Daily
Refresh cadence

Live from database

Updated Not synced yet
0
Civic records indexed
0
New this month
0
Unique sources indexed
0
Boroughs covered

How the Pipeline Works

1

Scrape & Fetch

GitHub Actions triggers daily. Python scrapers hit Legistar, NYS Senate Open Legislation, and NYC Open Data endpoints. TypeScript scrapers fetch member HTML from council, assembly, and senate sites.

2

Extract & Clean

Raw HTML is stripped. PDF transcripts are parsed for plain text. Junk content (empty strings, procedural placeholders) is filtered out in base_scraper.py before any storage occurs.

3

Classify

tag_classifier.py uses keyword matching and spaCy NER to tag each document with policy areas (Housing, Transit, Budget…), affected demographics, jurisdiction level, and policy stage.

4

Chunk & Embed

Documents are split into ~500-word sentence-aware chunks with 50-word overlap. Each chunk is passed through BAAI/bge-small-en-v1.5 to produce a 384-dim float32 vector.

5

Store

Chunks and their embeddings are inserted into Neon Postgres. Deduplication checks source_url before insert. The politicians.json cache is committed back to the repo.

6

Retrieve

At query time, the user's message (augmented with borough/ZIP context) is embedded and compared via pgvector cosine distance. Falls back to lexical search, then recency, if vector recall is empty.

7

Generate

Top chunks are passed to Groq (Llama 3.1 8B) with the user's demographic context. The LLM returns either a structured JSON briefing or plain markdown depending on the interface.

Steps 1-5 run automatically every day at 06:00 UTC via GitHub Actions

NYC City Council

Representative Directory

NYC Council Districts & Members

Live HTML scrape of all 51 council members, their districts, neighborhoods, party affiliations, committee memberships, and office contact info. Refreshed daily via GitHub Actions.

Official Legislative Records

NYC Council Legistar API

Official REST API for NYC Intro bills, Resolutions, committee meeting records, and voting history. Authenticated requests via NYC_COUNCIL_API_KEY.

Meeting Calendar & Transcripts

NYC Council Legistar Portal

Public calendar of all Council and committee sessions with agendas, minutes, and full hearing transcripts. PDFs are extracted, chunked, and indexed into the vector database.

Structured Meeting Records

NYC Open Data — Council Meetings

Socrata endpoint (m48u-yjt8) serving structured metadata for all finalized NYC Council meetings 1999-present. No API key required; used to build the meeting corpus.

Member Roster

NYC Open Data — Council Members

Socrata endpoint (uvw5-9znb) for current member roster including term dates and office IDs. Used as a secondary seed source for the database.

New York State Legislature

State Legislative Data

NYS Assembly Member Directory

Live scrape of all 150 Assembly members from nyassembly.gov. Party affiliation enriched via OpenStates REST API; committee assignments scraped from each member's /comm/ page.

State Legislative API

NYS Senate Open Legislation API

Official API for NY State Senate members, bills, resolutions, and floor transcripts. Used to populate senator profiles, session-year bills, and hearing transcripts going back to 2021.

Senator Profiles & Events

NYS Senate Portal

Official portal for senator bios, committee assignments, event listings, and town hall schedules. Photo URLs and bio slugs are resolved from this source.

Party & Committee Enrichment

OpenStates API (REST + GraphQL)

Third-party aggregator used to enrich NY legislators with party affiliation (via district-keyed REST) and committee memberships (via GraphQL). Results are merged with live scrape data.

Geospatial Boundary Data

NYS Board of Elections — Interactive Map

ArcGIS-hosted map of Congressional, State Senate, and Assembly districts across New York State. Embedded in the NYS Explorer tab.

Boundary Shapefiles

NYS Open GIS Portal

Official NYS Department of State GIS datasets. Source for NYS Senate, Assembly, and Congressional district boundary GeoJSON files served locally.

Federal Representation

Federal Directory

U.S. House of Representatives

Official member directory. NY representatives are scraped from the state table including district, party, and committee assignments.

Federal Legislative Tracking

GovTrack — NY Congressional Delegation

Independent tracker for New York's 26 House members and 2 Senators, including voting records and sponsorship data.

U.S. Senate

Senator Schumer — Official Site

Official profile for Chuck Schumer including committee memberships, caucus roles, and constituency contact info.

U.S. Senate

Senator Gillibrand — Official Site

Official profile for Kirsten Gillibrand including subcommittee assignments across Armed Services, Appropriations, and Intelligence.

Geospatial & Boundary Data

GeoJSON — City

NYC Council District Boundaries

51-district council boundary file served at /boundaries-districts.geojson. Powers the NYC Explorer choropleth map and the geocode-to-district lookup.

GeoJSON — City

NYC Borough Boundaries

5-borough boundary file (MIT-licensed via codeforgermany/click_that_hood, original data: NYC Open Data DCP). Used as an optional map overlay layer.

GeoJSON — City

NYC Neighborhood Tabulation Areas

195 NYC Planning NTAs (2020) served at /boundaries-neighborhoods.geojson. Used to derive neighborhood-to-district crosswalks for the RAG location context.

GeoJSON — City

NYC MODZCTA ZIP Crosswalk

Modified ZCTA shapefile used to compute ZIP-code-to-council-district crosswalks via polygon intersection (Shapely). Stored per district in the database.

GeoJSON — Federal

NYS Congressional District Boundaries

All 26 NY Congressional district boundaries served at /boundaries-congressional.geojson. Sourced from NYS ITS GIS.

GeoJSON — State

NYS Senate & Assembly Boundaries

63 Senate and 150 Assembly district boundaries served locally. Used as optional overlay layers in the Civic Hub map.

Address Resolution

NYC Planning Labs Geocoder

Free, no-key-required geocoding API for NYC addresses. Used in the NYC Explorer to convert street addresses into lat/lng for district lookup.

Technical Infrastructure

Frontend Framework

Next.js 14 + Tailwind CSS

App Router with server components, edge-compatible API routes, and a 24-hour CDN cache on the politicians route. Static GeoJSON boundary files are served from /public, with dynamic imports for glassmorphism/animation.

Client Persistence

localStorage (browser-native)

Three separate localStorage keys for user demographics, accessibility settings (7 toggle states), and theme (light/dark).

Backend Framework

FastAPI + Python

Async REST API hosted on Render (free tier). Handles RAG orchestration, pgvector similarity search, and LLM generation. SlowAPI rate-limiting at 10 req/min per IP. CORS restricted to production and localhost:3000.

Accessibility

Web Speech API (browser-native)

SpeechSynthesisUtterance used by AccessibilityWidget for text-to-speech. Reads selected text or full page main content. No external dependency — uses browser's built-in voices.

HTML Scraping

Cheerio + BeautifulSoup

Cheerio (Node/TypeScript) handles council.nyc.gov, nyassembly.gov, and house.gov scraping. BeautifulSoup (Python) handles RSS and supplementary HTML sources in the pipeline.

Mapping

react-simple-maps + Leaflet.js

react-simple-maps renders the NYC Council district choropleth (GeoJSON projection). Leaflet.js (dynamically imported) renders the Civic Hub map with pins and toggleable boundary layers.

Inference Engine

Groq Cloud — Llama 3.1 8B

RAG-grounded responses and structured briefings. Sub-second inference via Groq's LPU hardware. Handles structured JSON (Policy Synthesis sections) and plain markdown (Ask Spiegel chat).

Fallback LLM

OpenAI GPT-4o-mini

Used when Groq/RAG returns insufficient context (retrieval_tier='none'). Two paths: /api/llm/chat (direct, for /chat page) and /api/civic/floating-chat orchestration (fallback when RAG fails).

Vector Database

Neon Serverless Postgres + pgvector

Serverless Postgres with pgvector extension. Stores 384-dimension float32 embeddings in DocumentChunk rows alongside plain text, enabling cosine-similarity search without a separate vector store.

Embedding Engine

FastEmbed — BAAI/bge-small-en-v1.5

ONNX-based embedding model generating 384-dim vectors at ingest time (pipeline) and query time (backend). Falls back to HuggingFace Inference API when local model is unavailable.

Classification Layer

spaCy NLP Pipeline

en_core_web_sm model used in tag_classifier.py to extract named entities (politicians, agencies, locations). Also used for keyword classification of documents into policy areas and affected demographics.

Automation & CI

GitHub Actions (×2 workflows)

Two scheduled workflows run daily at 06:00 UTC: run_pipeline.yml ingests new legislative documents into Neon; refresh-politicians.yml scrapes fresh member data and commits politicians.json to the repo.

Civic Resources for Residents

These aren't data sources we ingest — they're tools we recommend to users who want to go deeper.

Database Model

A single Neon Postgres instance handles relational data and vector search via pgvector — no separate vector store needed.

Politician

Name, party, role, district, borough, bio URL

District

ZIP codes, NTAs, and jurisdiction per district number

LegislationEvent

Bills, resolutions, status, date, URL

VoteRecord

Links politician ↔ event with vote_cast value

PolicyDocument

Scraped source with metadata_tags JSON

DocumentChunk

Text chunk + 384-dim pgvector embedding

Retrieval Strategy

Three-tier fallback ensures answers even when vector recall is sparse.

Vector

pgvector cosine similarity on embedded query (augmented with borough/ZIP context)

Lexical

ILIKE full-text keyword match across chunk text when vector recall is empty

Recency

Most recently ingested chunks returned if both vector and lexical fail

Automated Refresh

Two GitHub Actions workflows keep data current without manual intervention.

run_pipeline.yml
Daily 06:00 UTC

Scrapes Legistar, NYS Senate bills, meeting records; embeds and upserts into Neon.

refresh-politicians.yml
Daily 06:00 UTC

Scrapes all 290+ member pages, builds politicians.json, commits back to the repo.

No ads, no sponsored content. All data is sourced from official government APIs and public HTML pages. Civic Spiegel does not sell or share user data. The representative directory is non-partisan and pulls directly from official government sites.