Summary
Extract provider information from GTFS agency.txt and feed_info.txt, store unique raw file records, and allow humans to link them to canonical feed_provider records.
Problem
GTFS files include agencies and publishers, but names and URLs are not always consistent.
Example:
TTC
Toronto Transit Commission
Toronto Transit Commission (TTC)
The system should not guess. It should store the raw values and support manual normalization.
Goals
- Extract
agency.txt and feed_info.txt
- Avoid duplicate raw records when the file values are identical
- Link datasets to the extracted records
- Add canonical
feed_provider records
- Store provider names and contacts
- Link raw records to a
feed_provider
- Generate feed-level provider roles
- Reuse known mappings only on exact name and URL match
Proposed model
Feed provider tables
-
feed_provider
id
name
organization_id
status: wip, published, not_published
created_at
updated_at
-
feed_provider_name_alias
id
feed_provider_id
value
-
feed_provider_contact
id
feed_provider_id
contact_type
value
Contact types:
Raw GTFS file tables
Feed-level role table
feed_provider_feed_role
feed_id
feed_provider_id
role
Supported roles:
Process
- Parse
agency.txt and feed_info.txt when a GTFS dataset is processed.
- Create or reuse
agency_txt and feed_info_txt records using content_hash.
- Link the dataset to those records.
- If an exact known mapping exists, assign
feed_provider_id.
- If not, create a new feed_provider and leave it unlinked for human review.
- When linked to a provider, add missing names and contacts.
- Generate feed-level roles from the latest dataset.
Matching rules
Reuse a known mapping only when both values match exactly:
agency_name or agency_url
feed_publisher_name or feed_publisher_url
API proposal
GET /v1/gtfs/datasets/{dataset_id}/agency-txt
GET /v1/gtfs/datasets/{dataset_id}/feed-info-txt
GET /v1/feeds/{feed_id}/feed-providers?role=agency|publisher
Acceptance criteria
agency.txt and feed_info.txt are extracted.
- Identical raw records are stored once and linked to datasets.
- Raw records can be linked to a
feed_provider.
feed_provider can have multiple names and contacts.
feed_provider.organization_id is optional.
- Feed-level provider roles are generated from linked records.
- Exact known mappings are reused.
- No fuzzy matching or confidence score is added.
Summary
Extract provider information from GTFS
agency.txtandfeed_info.txt, store unique raw file records, and allow humans to link them to canonicalfeed_providerrecords.Problem
GTFS files include agencies and publishers, but names and URLs are not always consistent.
Example:
TTCToronto Transit CommissionToronto Transit Commission (TTC)The system should not guess. It should store the raw values and support manual normalization.
Goals
agency.txtandfeed_info.txtfeed_providerrecordsfeed_providerProposed model
Feed provider tables
feed_provideridnameorganization_idstatus: wip, published, not_publishedcreated_atupdated_atfeed_provider_name_aliasidfeed_provider_idvaluefeed_provider_contactidfeed_provider_idcontact_typevalueContact types:
websiteemailphoneRaw GTFS file tables
agency_txtagency.txtagency_id,agency_name,agency_url, timezone, phone, email, fare URLfeed_provider_idcontent_hashgtfs_dataset_agency_txtagency_txtrowsfeed_info_txtfeed_info.txtfeed_provider_idcontent_hashgtfs_dataset_feed_info_txtfeed_info_txtrowsFeed-level role table
feed_provider_feed_rolefeed_idfeed_provider_idroleSupported roles:
agencypublisherProcess
agency.txtandfeed_info.txtwhen a GTFS dataset is processed.agency_txtandfeed_info_txtrecords usingcontent_hash.feed_provider_id.Matching rules
Reuse a known mapping only when both values match exactly:
agency_nameoragency_urlfeed_publisher_nameorfeed_publisher_urlAPI proposal
GET /v1/gtfs/datasets/{dataset_id}/agency-txtGET /v1/gtfs/datasets/{dataset_id}/feed-info-txtGET /v1/feeds/{feed_id}/feed-providers?role=agency|publisherAcceptance criteria
agency.txtandfeed_info.txtare extracted.feed_provider.feed_providercan have multiple names and contacts.feed_provider.organization_idis optional.