Specification: ReadmeExtract Model

Purpose

Defines the ReadmeExtract model class responsible for extracting interpreted information from README frontmatter that are not verbatim parts of the original Markdown.

1. Class Definition

1.1 ReadmeExtract Model

class ReadmeExtract(BaseModel):
    """Extracted and interpreted information from README frontmatter."""
    
    # Frontmatter-derived fields
    authors: list[str] = Field(default_factory=list)
    supervisors: list[str] = Field(default_factory=list)
    
    # Raw frontmatter for reference
    raw_frontmatter: dict[str, Any] = Field(default_factory=dict)

2. Field Definitions

2.1 Frontmatter-Derived Fields

Field

Type

Description

Source

authors

list[str]

All author names

authors/author keys, processed per §3

supervisors

list[str]

Author names with Supervision role

authors/author keys, processed per §3

raw_frontmatter

dict

Complete frontmatter for reference

All YAML frontmatter

3. Authors/Supervisors Processing

3.1 Input Formats

The authors/author field in frontmatter can be:

  • String: Single author name

  • List of strings: Multiple author names

  • List of dicts: Author objects with name and roles fields

  • Dict: Single author object with name and roles fields

3.2 Processing Rules

  1. All authors: Extract all author names regardless of roles

  2. Supervisors: Extract author names whose roles list contains “Supervision” (case-insensitive)

  3. Name extraction:

    • String: Use as-is

    • Dict: Extract name field, skip if missing

    • List: Process each item recursively

  4. Deduplication: Remove duplicate author names (case-insensitive)

3.3 Examples

# String format
authors: "Alice Example"

# List of strings
authors: ["Alice Example", "Bob Example"]

# List of dicts
authors:
  - name: "Alice Example"
    roles: ["Supervision", "Conceptualization"]
  - name: "Bob Example"
    roles: ["Validation"]

# Mixed format
authors:
  - "Alice Example"
  - name: "Bob Example"
    roles: ["Supervision"]

4. Construction Process

  1. Parse frontmatter: Extract YAML between --- markers

  2. Process frontmatter fields: Apply processing to get extracted data

  3. Construct object: Create ReadmeExtract with all extracted data

5. Integration with Readme Model

The Readme model should include:

class Readme(BaseModel):
    # Readme-Fields (see spec_model_mapping.md or models/readme.py)
    # [...]
    # extra-field with the Extract (this spec)
    extra: ReadmeExtract  # All interpreted/extracted information from frontmatter

6. Error Handling

  • Invalid YAML: Log error, continue with empty frontmatter

  • Missing fields: Set to None, “” or empty list as appropriate

  • Malformed author data: Log warning, skip invalid entries, continue with valid ones

7. Examples

7.1 Simple README (No Frontmatter)

# My Project

This is the first paragraph of my project.

Extract:

  • All fields: “” or empty lists - depending on type; or None if type allows it.

  • raw_frontmatter: {}

7.2 README with Frontmatter

---
type: ML
priority: 5
authors:
  - name: Alice Example
    roles: [Supervision]
  - Bob Example
---

# Alpha One

## About

This is the main README for Alpha One.

Extract:

  • authors: [“Alice Example”, “Bob Example”]

  • supervisors: [“Alice Example”]

  • raw_frontmatter: {"type": "ML", "priority": 5, "authors": [...]}

8. Non-Goals

  • Network I/O

  • File system operations

  • Rendering or formatting

  • Validation beyond basic type checking

  • Content extraction (first paragraph, TODO sections) - see spec_model_mapping.md