Skip to content

[Draft][Epic] Extract and normalize GTFS feed providers #1715

Description

@davidgamez

Summary

Extract provider information from GTFS agency.txt and feed_info.txt, store unique raw file records, and allow humans to link them to canonical feed_provider records.

Problem

GTFS files include agencies and publishers, but names and URLs are not always consistent.

Example:

  • TTC
  • Toronto Transit Commission
  • Toronto Transit Commission (TTC)

The system should not guess. It should store the raw values and support manual normalization.

Goals

  • Extract agency.txt and feed_info.txt
  • Avoid duplicate raw records when the file values are identical
  • Link datasets to the extracted records
  • Add canonical feed_provider records
  • Store provider names and contacts
  • Link raw records to a feed_provider
  • Generate feed-level provider roles
  • Reuse known mappings only on exact name and URL match

Proposed model

Feed provider tables

  • feed_provider

    • id
    • name
    • organization_id
    • status: wip, published, not_published
    • created_at
    • updated_at
  • feed_provider_name_alias

    • id
    • feed_provider_id
    • value
  • feed_provider_contact

    • id
    • feed_provider_id
    • contact_type
    • value

Contact types:

  • website
  • email
  • phone

Raw GTFS file tables

  • agency_txt

    • unique rows from agency.txt
    • includes agency_id, agency_name, agency_url, timezone, phone, email, fare URL
    • includes optional feed_provider_id
    • includes content_hash
  • gtfs_dataset_agency_txt

    • links GTFS datasets to agency_txt rows
  • feed_info_txt

    • unique rows from feed_info.txt
    • includes publisher name, publisher URL, language, dates, version, contact email, contact URL
    • includes optional feed_provider_id
    • includes content_hash
  • gtfs_dataset_feed_info_txt

    • links GTFS datasets to feed_info_txt rows

Feed-level role table

  • feed_provider_feed_role
    • feed_id
    • feed_provider_id
    • role

Supported roles:

  • agency
  • publisher

Process

  1. Parse agency.txt and feed_info.txt when a GTFS dataset is processed.
  2. Create or reuse agency_txt and feed_info_txt records using content_hash.
  3. Link the dataset to those records.
  4. If an exact known mapping exists, assign feed_provider_id.
  5. If not, create a new feed_provider and leave it unlinked for human review.
  6. When linked to a provider, add missing names and contacts.
  7. Generate feed-level roles from the latest dataset.

Matching rules

Reuse a known mapping only when both values match exactly:

  • agency_name or agency_url
  • feed_publisher_name or feed_publisher_url

API proposal

  • GET /v1/gtfs/datasets/{dataset_id}/agency-txt
  • GET /v1/gtfs/datasets/{dataset_id}/feed-info-txt
  • GET /v1/feeds/{feed_id}/feed-providers?role=agency|publisher

Acceptance criteria

  • agency.txt and feed_info.txt are extracted.
  • Identical raw records are stored once and linked to datasets.
  • Raw records can be linked to a feed_provider.
  • feed_provider can have multiple names and contacts.
  • feed_provider.organization_id is optional.
  • Feed-level provider roles are generated from linked records.
  • Exact known mappings are reused.
  • No fuzzy matching or confidence score is added.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions