--- title: Data Collection Workflow --- # Specification: Data Collection Workflow ## Goal Aggregates information from GitLab (groups, projects, README files, issues) into a unified, enriched dataset that downstream renderers consume. The collector orchestrates calls to the [GitLab API Client](./spec_api_client.md) and produces structured *overview rows* (see [Model Mapping](./spec_model_mapping.md)). --- ## 1. Happy-Path Flow 1. Retrieve the list of groups visible to the configured API token(s). 2. For every group, retrieve projects. 3. For each project: * Determine the branch used for README lookup: prefer `default_branch` from project metadata; fall back to `main`. * Request `README.md` raw content; on 404 treat README as *missing*. * Request issues list. Some projects may not have issue tracking, yielding `issues=None`. 4. Transform raw JSON/text responses into domain objects via model mapping (see related spec) and compose an *overview row* containing: * Group object. * Project object. * Optional README object. * Optional List of Issue objects (may be empty list). * Extra metadata extracted from README front-matter (author, priority, etc.). 5. Return the list of rows to the caller, preserving the original discovery order. --- ## 2. Failure Handling | Failure point | Result | |---------------|--------| | Group listing request returns error | Raise *Collector Error* and abort collection. | | Project listing for a single group fails | Propagate error → abort collection (no partial results). | | README fetch returns 404 | Record `readme = None`; continue processing other artefacts. | | README fetch returns non-404 error | Propagate as *Collector Error*. | | Issues request fails | Record `issues = None`; continue processing other artefacts. | Errors are *not* silently ignored (except the explicitly graceful README-missing/no Issue-tracker case). --- ## 3. Concurrency & Rate Limits * Implementations may perform project-level fetches in parallel, but **must honour** the rate-limit handling strategy defined in the API client. * Parallelism must not reorder final output; order is defined by input discovery sequence (§1-5). --- ## 4. Output Contract * Returns an **ordered, in-memory collection** of overview rows. * `None` is always interpreted as "does not has this feature", "not found", etc. Empty values (such as `""`, `[]`, ..) indicates the presence in the API, but empty. * No persistence or caching is performed at this layer. * The consumer applies sorting/grouping according to their own needs (see [Table Sorting](./spec_table_sorting.md)). --- ## 5. Non-Goals * Command-line parsing, configuration merging, or environment handling (see [Settings](./spec_settings.md)). * Rendering concerns of any kind – these are addressed by higher-level specs.