Add Python tool scaffold for PokeDB data import

Set up tools/import-pokedb/ with CLI, JSON loader, and output models. Replaces the Go/PokeAPI approach with local PokeDB.org JSON processing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 09:49:51 +01:00
parent 5151be785b
commit 1aa67665ff
11 changed files with 522 additions and 23 deletions
--- a/.beans/nuzlocke-tracker-bs05--build-pokedborg-encounter-data-scraper.md
+++ b/.beans/nuzlocke-tracker-bs05--build-pokedborg-encounter-data-scraper.md
@@ -1,17 +1,22 @@
 ---
 # nuzlocke-tracker-bs05
 title: Build PokeDB.org data import tool
-status: draft
-type: task
+status: in-progress
+type: feature
 priority: normal
 created_at: 2026-02-10T14:04:11Z
-updated_at: 2026-02-10T14:31:08Z
+updated_at: 2026-02-11T08:44:03Z
 parent: nuzlocke-tracker-rzu4
 blocking:
    - nuzlocke-tracker-spx3
 ---

-Build a Go tool that converts PokeDB.org's JSON data export into our existing seed JSON format. This replaces PokeAPI as the single source of truth for ALL games (Gen 1-9).
+Build a standalone Python tool that converts PokeDB.org's JSON data export into our existing seed JSON format. This replaces PokeAPI as the single source of truth for ALL games (Gen 1-9).
+
+Python was chosen over Go because:
+- The backend is already Python, so the team is familiar with it
+- We're processing local JSON files — no need for Go's concurrency
+- Remains a standalone tool in `tools/import-pokedb/`, not part of the backend

 ## Data source

@@ -64,26 +69,15 @@ Each encounter record has:
 - `visible` — overworld vs hidden encounter
 - Max Raid and Tera Raid fields for special encounters

-## Implementation approach
+## Subtasks

-### Checklist
- [ ] Set up project structure in `tools/import-pokedb/`
- [ ] Download and cache PokeDB JSON export files
- [ ] Parse PokeDB encounters, locations, location_areas, versions, pokemon_forms
- [ ] Build lookup maps: pokemon_form_identifier → pokeapi_id (using existing `pokemon.json`)
- [ ] Build lookup maps: location_area_identifier → location name + region
- [ ] Filter encounters by target game version
- [ ] Map PokeDB encounter methods to our seed format methods (73 → simplified set)
- [ ] Parse level strings ("2 - 4" → min_level: 2, max_level: 4)
- [ ] Handle rate variants per game generation:
-  - For now, flatten time/weather/season rates into `encounter_rate` (use the max or average)
-  - Preserve raw variant data for future use (see nuzlocke-tracker-oqfo)
- [ ] Group encounters by location area → route output
- [ ] Apply route ordering (use existing route_order.json or generate from location data)
- [ ] Output in existing `{game}.json` seed format
- [ ] Generate seed data for ALL games, replacing PokeAPI as the single source of truth
- [ ] Compare output against existing PokeAPI-sourced data to validate accuracy
- [ ] Run for all games and verify output
+Work is broken into child task beans:
+
+- [ ] **Set up Python tool scaffold** — project structure, CLI entry point, PokeDB JSON file loading
+- [ ] **Build reference data mappings** — pokemon_form → pokeapi_id, location_area → name/region, encounter method mapping
+- [ ] **Core encounter processing** — filter by game version, parse levels, handle rate variants, group by location area
+- [ ] **Output seed JSON** — produce per-game JSON in existing format, integrate route ordering + special encounters
+- [ ] **Validation & full generation** — compare against existing data, run for all games, fix discrepancies

 ## Encounter method mapping (draft)

--- a/.beans/nuzlocke-tracker-dqyb--set-up-python-tool-scaffold.md
+++ b/.beans/nuzlocke-tracker-dqyb--set-up-python-tool-scaffold.md
@@ -0,0 +1,30 @@
+---
+# nuzlocke-tracker-dqyb
+title: Set up Python tool scaffold
+status: in-progress
+type: task
+priority: normal
+created_at: 2026-02-11T08:42:58Z
+updated_at: 2026-02-11T08:44:03Z
+parent: nuzlocke-tracker-bs05
+blocking:
+    - nuzlocke-tracker-zno2
+---
+
+Set up the standalone Python tool project in `tools/import-pokedb/`.
+
+## Checklist
+
+- [x] Create `tools/import-pokedb/` directory structure
+- [x] Set up `pyproject.toml` with dependencies (just stdlib should suffice for JSON processing, maybe `click` for CLI)
+- [x] Create CLI entry point (`__main__.py` or similar) that accepts:
+  - Path to directory containing PokeDB JSON export files
+  - Target output directory (default: `backend/src/app/seeds/data/`)
+  - Optional: specific game version to generate (default: all)
+- [x] Load and parse all PokeDB JSON files: `encounters.json`, `locations.json`, `location_areas.json`, `encounter_methods.json`, `versions.json`, `pokemon_forms.json`
+- [x] Basic validation that all expected files are present and parseable
+
+## Notes
+- Keep it as a standalone tool, not part of the backend
+- The PokeDB JSON files are downloaded manually from https://pokedb.org/data-export — no need to automate the download
+- Model the CLI similarly to how `tools/fetch-pokeapi/` works (cd into dir, run the tool)
--- a/.beans/nuzlocke-tracker-gkcy--output-seed-json.md
+++ b/.beans/nuzlocke-tracker-gkcy--output-seed-json.md
@@ -0,0 +1,31 @@
+---
+# nuzlocke-tracker-gkcy
+title: Output seed JSON
+status: todo
+type: task
+priority: normal
+created_at: 2026-02-11T08:43:21Z
+updated_at: 2026-02-11T08:43:33Z
+parent: nuzlocke-tracker-bs05
+blocking:
+    - nuzlocke-tracker-vdks
+---
+
+Generate the final per-game JSON files in the existing seed format.
+
+## Checklist
+
+- [ ] **Apply route ordering**: Use the existing `backend/src/app/seeds/route_order.json` to assign `order` values to routes. Handle aliases (e.g. "red-blue" → "firered-leafgreen"). Log warnings for routes not in the order file.
+- [ ] **Merge special encounters**: Integrate starters, gifts, fossils, and trades from `backend/src/app/seeds/special_encounters.json` into the appropriate routes.
+- [ ] **Output per-game JSON**: Write `{game-slug}.json` files matching the existing format:
+  ```json
+  [{"name": "Route 1", "order": 3, "encounters": [...], "children": []}]
+  ```
+- [ ] **Output games.json**: Generate the global games list from `version_groups.json` (this may already be handled by existing config, verify).
+- [ ] **Output pokemon.json**: Generate the global pokemon list including all pokemon referenced in any encounter. Include pokeapi_id, national_dex, name, types, sprite_url.
+- [ ] **Handle version exclusives**: Ensure encounters specific to one version in a version group only appear in that game's JSON file (e.g. FireRed exclusives vs LeafGreen exclusives).
+
+## Notes
+- The output must be a drop-in replacement for the existing files in `backend/src/app/seeds/data/`
+- Boss data (`{game}-bosses.json`) is NOT generated by this tool — it's manually curated
+- Evolutions data is also separate (currently from PokeAPI) — out of scope for this task
--- a/.beans/nuzlocke-tracker-rfg0--core-encounter-processing.md
+++ b/.beans/nuzlocke-tracker-rfg0--core-encounter-processing.md
@@ -0,0 +1,34 @@
+---
+# nuzlocke-tracker-rfg0
+title: Core encounter processing
+status: todo
+type: task
+priority: normal
+created_at: 2026-02-11T08:43:12Z
+updated_at: 2026-02-11T08:43:33Z
+parent: nuzlocke-tracker-bs05
+blocking:
+    - nuzlocke-tracker-gkcy
+---
+
+Implement the core logic that transforms raw PokeDB encounter records into our internal format.
+
+## Checklist
+
+- [ ] **Filter by game version**: Given a target game slug, select only encounters where `version_identifiers` includes that game
+- [ ] **Parse level strings**: Convert "2 - 4" → min_level=2, max_level=4; "67" → min_level=67, max_level=67
+- [ ] **Handle rate variants per generation**:
+  - Gen 1/3/6: use `rate_overall` directly as `encounter_rate`
+  - Gen 2/4: `rate_morning`, `rate_day`, `rate_night` — flatten to max or average for `encounter_rate`
+  - Gen 5: `rate_spring` through `rate_winter` — flatten similarly
+  - Gen 8 Sw/Sh: `weather_*_rate` fields — flatten to max
+  - Gen 8 Legends Arceus: `during_*` / `while_*` booleans — convert to a presence-based rate
+  - Gen 9 Sc/Vi: `probability_*` fields (spawn weights, not percentages) — normalize to percentages
+  - Preserve raw variant data in a way that nuzlocke-tracker-oqfo can use later
+- [ ] **Aggregate encounters**: Group by (pokemon, method, location_area) and merge level ranges / rates where appropriate (same logic as the Go tool's aggregation)
+- [ ] **Group by location area**: Collect all encounters for a location area into a route structure
+- [ ] **Handle parent/child routes**: Multi-area locations (e.g. Safari Zone) should produce parent routes with children, matching the existing hierarchical format
+
+## Notes
+- Rate parsing needs to handle percentage strings like "40%" as well as bare numbers
+- The Go tool aggregates encounters with the same pokemon+method at a location into a single entry with merged level ranges — replicate this
--- a/.beans/nuzlocke-tracker-vdks--validation-and-full-generation.md
+++ b/.beans/nuzlocke-tracker-vdks--validation-and-full-generation.md
@@ -0,0 +1,29 @@
+---
+# nuzlocke-tracker-vdks
+title: Validation and full generation
+status: todo
+type: task
+created_at: 2026-02-11T08:43:29Z
+updated_at: 2026-02-11T08:43:29Z
+parent: nuzlocke-tracker-bs05
+---
+
+Validate the new tool's output against existing data and generate seed data for all games.
+
+## Checklist
+
+- [ ] **Diff against existing data**: For games we already have PokeAPI-sourced data for, compare the PokeDB output. Identify and investigate discrepancies:
+  - Missing routes or encounters
+  - Different encounter rates
+  - Different level ranges
+  - Missing or extra pokemon
+- [ ] **Fix discrepancies**: Adjust mappings, parsing, or aggregation logic to resolve legitimate differences. Document cases where PokeDB provides better/different data than PokeAPI.
+- [ ] **Generate for all games**: Run the tool for every game version in `version_groups.json`. Verify output is valid JSON and structurally correct.
+- [ ] **New game coverage**: For games not previously supported (or with incomplete PokeAPI data), verify the output looks reasonable by spot-checking a few routes.
+- [ ] **Update route_order.json**: Add route orderings for any new games that didn't have entries. This may require manual curation.
+- [ ] **Update special_encounters.json**: Add special encounters for any new games. This may require manual curation.
+
+## Notes
+- This is the final validation step before we can replace PokeAPI as the data source
+- Some discrepancies are expected — PokeDB may have more complete data than PokeAPI
+- Route ordering for new games will likely need manual work
--- a/.beans/nuzlocke-tracker-zno2--build-reference-data-mappings.md
+++ b/.beans/nuzlocke-tracker-zno2--build-reference-data-mappings.md
@@ -0,0 +1,26 @@
+---
+# nuzlocke-tracker-zno2
+title: Build reference data mappings
+status: todo
+type: task
+priority: normal
+created_at: 2026-02-11T08:43:02Z
+updated_at: 2026-02-11T08:43:33Z
+parent: nuzlocke-tracker-bs05
+blocking:
+    - nuzlocke-tracker-rfg0
+---
+
+Build the lookup maps needed to translate PokeDB identifiers into our seed format.
+
+## Checklist
+
+- [ ] **Pokemon form mapping**: Map `pokemon_form_identifier` (e.g. "pidgey-default", "mr-mime-default") to `pokeapi_id` using the existing `backend/src/app/seeds/data/pokemon.json` as reference. Handle naming convention differences between PokeDB and PokeAPI (may need fuzzy matching or a manual override table).
+- [ ] **Location area mapping**: Map `location_area_identifier` to human-readable location names and regions using `locations.json` and `location_areas.json`. Produce names matching our existing format (e.g. "Route 1", "Viridian Forest").
+- [ ] **Encounter method mapping**: Map PokeDB's 73 encounter methods to our simplified set. See the draft mapping in the parent bean. Implement as a dictionary/config that's easy to extend.
+- [ ] **Version mapping**: Map PokeDB `version_identifiers` to our game slugs (should mostly be 1:1 but verify).
+
+## Notes
+- The pokemon form mapping is the trickiest part — PokeDB uses identifiers like "mr-mime-default" while our pokemon.json uses names like "Mr. Mime" and pokeapi IDs
+- Log warnings for any unmapped identifiers so we can add overrides
+- The `pokemon_forms.json` from PokeDB may help bridge the gap
--- a/tools/import-pokedb/import_pokedb/init.py
+++ b/tools/import-pokedb/import_pokedb/init.py
--- a/tools/import-pokedb/import_pokedb/main.py
+++ b/tools/import-pokedb/import_pokedb/main.py
@@ -0,0 +1,115 @@
+"""CLI entry point for the PokeDB import tool.
+
+Usage:
+    # From repo root:
+    python -m import_pokedb ./pokedb-export/
+
+    # With options:
+    python -m import_pokedb ./pokedb-export/ --output backend/src/app/seeds/data/ --game firered
+"""
+
+from __future__ import annotations
+
+import argparse
+import sys
+from pathlib import Path
+
+from .loader import load_pokedb_data, load_seed_config
+
+SEEDS_DIR_CANDIDATES = [
+    Path("backend/src/app/seeds"),                # from repo root
+    Path("../../backend/src/app/seeds"),           # from tools/import-pokedb/
+]
+
+
+def find_seeds_dir() -> Path:
+    """Locate the backend seeds directory."""
+    for candidate in SEEDS_DIR_CANDIDATES:
+        if (candidate / "version_groups.json").exists():
+            return candidate.resolve()
+    # Fallback
+    return Path("backend/src/app/seeds").resolve()
+
+
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(
+        prog="import-pokedb",
+        description="Convert PokeDB.org JSON data exports into nuzlocke-tracker seed format.",
+    )
+    parser.add_argument(
+        "pokedb_dir",
+        type=Path,
+        help="Path to directory containing PokeDB JSON export files",
+    )
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=None,
+        help="Output directory for seed JSON files (default: backend/src/app/seeds/data/)",
+    )
+    parser.add_argument(
+        "--game",
+        type=str,
+        default=None,
+        help="Generate data for a specific game slug only (default: all games)",
+    )
+    return parser
+
+
+def main(argv: list[str] | None = None) -> None:
+    parser = build_parser()
+    args = parser.parse_args(argv)
+
+    pokedb_dir: Path = args.pokedb_dir
+    if not pokedb_dir.is_dir():
+        print(f"Error: {pokedb_dir} is not a directory", file=sys.stderr)
+        sys.exit(1)
+
+    seeds_dir = find_seeds_dir()
+    output_dir: Path = args.output or (seeds_dir / "data")
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    print(f"PokeDB data:  {pokedb_dir.resolve()}")
+    print(f"Seeds config: {seeds_dir}")
+    print(f"Output:       {output_dir.resolve()}")
+    print()
+
+    # Load PokeDB export data
+    pokedb = load_pokedb_data(pokedb_dir)
+    print(pokedb.summary())
+    print()
+
+    # Load existing seed configuration
+    config = load_seed_config(seeds_dir)
+    print(f"Loaded {len(config.version_groups)} version groups")
+    print(f"Loaded route order for {len(config.route_order)} version groups")
+    if config.special_encounters:
+        se_count = len(config.special_encounters.get("encounters", {}))
+        print(f"Loaded special encounters for {se_count} version groups")
+    print()
+
+    # Determine which games to process
+    target_game = args.game
+    if target_game:
+        found = False
+        for vg_info in config.version_groups.values():
+            if target_game in vg_info.get("versions", []):
+                found = True
+                break
+        if not found:
+            print(f"Error: Game '{target_game}' not found in version_groups.json", file=sys.stderr)
+            sys.exit(1)
+        print(f"Target: {target_game}")
+    else:
+        total_games = sum(
+            len(vg.get("versions", []))
+            for vg in config.version_groups.values()
+        )
+        print(f"Target: all {total_games} games")
+
+    # TODO: Processing pipeline (subtasks zno2, rfg0, gkcy)
+    print("\nScaffold loaded successfully. Processing pipeline not yet implemented.")
+
+
+if __name__ == "__main__":
+    main()
--- a/tools/import-pokedb/import_pokedb/loader.py
+++ b/tools/import-pokedb/import_pokedb/loader.py
@@ -0,0 +1,150 @@
+"""Load and validate PokeDB JSON export files."""
+
+from __future__ import annotations
+
+import json
+import sys
+from pathlib import Path
+from typing import Any
+
+REQUIRED_FILES = [
+    "encounters.json",
+    "locations.json",
+    "location_areas.json",
+    "encounter_methods.json",
+    "versions.json",
+    "pokemon_forms.json",
+]
+
+
+class PokeDBData:
+    """Container for all loaded PokeDB export data."""
+
+    def __init__(
+        self,
+        encounters: list[dict[str, Any]],
+        locations: list[dict[str, Any]],
+        location_areas: list[dict[str, Any]],
+        encounter_methods: list[dict[str, Any]],
+        versions: list[dict[str, Any]],
+        pokemon_forms: list[dict[str, Any]],
+    ) -> None:
+        self.encounters = encounters
+        self.locations = locations
+        self.location_areas = location_areas
+        self.encounter_methods = encounter_methods
+        self.versions = versions
+        self.pokemon_forms = pokemon_forms
+
+    def summary(self) -> str:
+        return (
+            f"PokeDB data loaded:\n"
+            f"  encounters:        {len(self.encounters):,}\n"
+            f"  locations:         {len(self.locations):,}\n"
+            f"  location_areas:    {len(self.location_areas):,}\n"
+            f"  encounter_methods: {len(self.encounter_methods):,}\n"
+            f"  versions:          {len(self.versions):,}\n"
+            f"  pokemon_forms:     {len(self.pokemon_forms):,}"
+        )
+
+
+def load_pokedb_data(data_dir: Path) -> PokeDBData:
+    """Load all PokeDB JSON export files from a directory.
+
+    Exits with an error message if any required files are missing or unparseable.
+    """
+    missing = [f for f in REQUIRED_FILES if not (data_dir / f).exists()]
+    if missing:
+        print(
+            f"Error: Missing required PokeDB files in {data_dir}:",
+            file=sys.stderr,
+        )
+        for f in missing:
+            print(f"  - {f}", file=sys.stderr)
+        print(
+            "\nDownload the JSON export from https://pokedb.org/data-export",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+
+    def _load(filename: str) -> list[dict[str, Any]]:
+        path = data_dir / filename
+        try:
+            with open(path) as f:
+                data = json.load(f)
+        except json.JSONDecodeError as e:
+            print(f"Error: Failed to parse {path}: {e}", file=sys.stderr)
+            sys.exit(1)
+
+        if not isinstance(data, list):
+            print(
+                f"Error: Expected a JSON array in {path}, got {type(data).__name__}",
+                file=sys.stderr,
+            )
+            sys.exit(1)
+
+        return data
+
+    return PokeDBData(
+        encounters=_load("encounters.json"),
+        locations=_load("locations.json"),
+        location_areas=_load("location_areas.json"),
+        encounter_methods=_load("encounter_methods.json"),
+        versions=_load("versions.json"),
+        pokemon_forms=_load("pokemon_forms.json"),
+    )
+
+
+class SeedConfig:
+    """Container for existing seed configuration files."""
+
+    def __init__(
+        self,
+        version_groups: dict[str, Any],
+        route_order: dict[str, list[str]],
+        special_encounters: dict[str, Any] | None,
+    ) -> None:
+        self.version_groups = version_groups
+        self.route_order = route_order
+        self.special_encounters = special_encounters
+
+
+def load_seed_config(seeds_dir: Path) -> SeedConfig:
+    """Load existing seed configuration files (version_groups, route_order, etc.).
+
+    Exits with an error message if required config files are missing.
+    """
+    vg_path = seeds_dir / "version_groups.json"
+    if not vg_path.exists():
+        print(f"Error: version_groups.json not found at {vg_path}", file=sys.stderr)
+        sys.exit(1)
+
+    with open(vg_path) as f:
+        version_groups = json.load(f)
+
+    # Load route_order.json and resolve aliases
+    ro_path = seeds_dir / "route_order.json"
+    if not ro_path.exists():
+        print(f"Error: route_order.json not found at {ro_path}", file=sys.stderr)
+        sys.exit(1)
+
+    with open(ro_path) as f:
+        ro_raw = json.load(f)
+
+    route_order: dict[str, list[str]] = dict(ro_raw.get("routes", {}))
+    for alias, target in ro_raw.get("aliases", {}).items():
+        if target in route_order:
+            route_order[alias] = route_order[target]
+
+    # Load special_encounters.json (optional)
+    se_path = seeds_dir / "special_encounters.json"
+    special_encounters = None
+    if se_path.exists():
+        with open(se_path) as f:
+            special_encounters = json.load(f)
+
+    return SeedConfig(
+        version_groups=version_groups,
+        route_order=route_order,
+        special_encounters=special_encounters,
+    )
--- a/tools/import-pokedb/import_pokedb/models.py
+++ b/tools/import-pokedb/import_pokedb/models.py
@@ -0,0 +1,81 @@
+"""Output data models matching the existing seed JSON format."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+
+
+@dataclass
+class Encounter:
+    pokeapi_id: int
+    pokemon_name: str
+    method: str
+    encounter_rate: int
+    min_level: int
+    max_level: int
+
+    def to_dict(self) -> dict:
+        return {
+            "pokeapi_id": self.pokeapi_id,
+            "pokemon_name": self.pokemon_name,
+            "method": self.method,
+            "encounter_rate": self.encounter_rate,
+            "min_level": self.min_level,
+            "max_level": self.max_level,
+        }
+
+
+@dataclass
+class Route:
+    name: str
+    order: int
+    encounters: list[Encounter] = field(default_factory=list)
+    children: list[Route] = field(default_factory=list)
+
+    def to_dict(self) -> dict:
+        d: dict = {
+            "name": self.name,
+            "order": self.order,
+            "encounters": [e.to_dict() for e in self.encounters],
+        }
+        if self.children:
+            d["children"] = [c.to_dict() for c in self.children]
+        return d
+
+
+@dataclass
+class Game:
+    name: str
+    slug: str
+    generation: int
+    region: str
+    release_year: int
+    color: str | None = None
+
+    def to_dict(self) -> dict:
+        return {
+            "name": self.name,
+            "slug": self.slug,
+            "generation": self.generation,
+            "region": self.region,
+            "release_year": self.release_year,
+            "color": self.color,
+        }
+
+
+@dataclass
+class Pokemon:
+    pokeapi_id: int
+    national_dex: int
+    name: str
+    types: list[str]
+    sprite_url: str
+
+    def to_dict(self) -> dict:
+        return {
+            "pokeapi_id": self.pokeapi_id,
+            "national_dex": self.national_dex,
+            "name": self.name,
+            "types": self.types,
+            "sprite_url": self.sprite_url,
+        }
--- a/tools/import-pokedb/pyproject.toml
+++ b/tools/import-pokedb/pyproject.toml
@@ -0,0 +1,9 @@
+[project]
+name = "import-pokedb"
+version = "0.1.0"
+description = "Convert PokeDB.org JSON data exports into nuzlocke-tracker seed format"
+requires-python = ">=3.12"
+dependencies = []
+
+[project.scripts]
+import-pokedb = "import_pokedb.__main__:main"