Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
bc9236f
style: black and flaky8
Anket11 Mar 25, 2026
fc28643
feat(data_access): add base and concrete FAISS index managers for Kno…
Anket11 Mar 31, 2026
d3735f5
feat(config): add per-type similarity thresholds and improve KST prom…
Anket11 Mar 31, 2026
3c7e3ca
feat(services): generalize AlignmentService and wire KST indexes with…
Anket11 Mar 31, 2026
20303b7
test: add ENABLES edge derivation test for KST extraction
Anket11 Mar 31, 2026
9cebbcf
feat(scripts): add pipeline build scripts for Knowledge and Task indexes
Anket11 Mar 31, 2026
4489b19
feat(data): add prebuilt Knowledge and Task taxonomy indexes
Anket11 Mar 31, 2026
50941c4
Merge pull request #407 from LAiSER-Software/development
Anket11 Mar 31, 2026
cef27db
fix: wire source_url into top_k zip and result dataframe
Anket11 Mar 31, 2026
fcb79d2
v0.4.1
Anket11 Mar 31, 2026
1b803e6
chore: Index update and version bump
Anket11 Mar 31, 2026
53a92e1
Add task statements workbook
Anket11 May 6, 2026
1655073
feat(gemini): update Gemini integration
Anket11 May 6, 2026
52044af
feat(kst): add concept extraction API
Anket11 May 6, 2026
e821ad5
feat(knowledge): rebuild knowledge taxonomy index
Anket11 May 6, 2026
0974726
feat(task): rebuild task taxonomy index
Anket11 May 6, 2026
1709ed3
test(kst): update extraction coverage
Anket11 May 6, 2026
bb604b9
version bump: v0.5
Anket11 May 7, 2026
48a6a9a
style: fix line formatting
Anket11 May 7, 2026
9a6d317
Merge remote-tracking branch 'origin/main' into feat/v0.5-kst-extraction
Anket11 May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@ repos:
rev: 7.1.1
hooks:
- id: flake8
args: [--max-line-length=120, --extend-ignore=E203,W503,E266,W291,E501,E402]
args:
- "--max-line-length=120"
- "--extend-ignore=E203,W503,E266,W291,E501,E402"
13 changes: 12 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1 +1,12 @@
recursive-include laiser/public *
include laiser/public/faiss_skills.csv
include laiser/public/knowledge_df.json
include laiser/public/knowledge_taxonomy.csv
include laiser/public/knowledge_v05.index
include laiser/public/skills_df.json
include laiser/public/skills_v04.index
include laiser/public/task_taxonomy.csv
include laiser/public/tasks_df.json
include laiser/public/tasks_v05.index
exclude laiser/public/Task Statements.xlsx
exclude laiser/public/combined.csv
exclude laiser/public/skill_embeddings.npy
270 changes: 270 additions & 0 deletions docs/v0.5_architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
---

title: LAiSER v0.5 — Knowledge, Tasks & Graph Architecture

tags: [laiser, architecture, v0.5, graph, knowledge, tasks]

created: 2026-03-25

status: draft

---
# LAiSER v0.5 — Knowledge, Tasks & Graph Architecture

> [!quote] Our Definition

> **Knowledge + Tasks = Skill**

> A skill is not standalone. It is the product of knowing something and being able to act on it.

---

## What Completes the Extract Module

LAiSER's extract module currently pulls **Skills** from raw text. v0.5 extends this to extract all three pillars.

```mermaid

flowchart TD

A[Raw Text\njob posting · CV · course description] --> B[extract-module]

B --> C[Skills]

B --> D[Knowledge]

B --> E[Tasks]

```

---

## Top Sources — Knowledge

> [!info] Why Academia for Knowledge?

> Knowledge is fundamentally academic. When a job posting says "requires knowledge of machine learning" it maps to a scientific concept, not an HR category. O*NET anchors it to the workforce. OpenAlex and Wikipedia give it depth.


| Rank | Source | Instances | Description Context | HR / Corporate Relevance | URL | Free |
| ---- | ------------------------------- | --------- | -------------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------- | ---- |
| 1 | O*NET Knowledge + Level Anchors | ~3,300 | Name + definition + behavioral anchors at 7 proficiency levels | Very High — directly maps to job design and competency frameworks | onetcenter.org/database.html | Yes |
| 2 | ESCO Knowledge | ~3,000 | Name + definition + alt labels + broader/narrower concepts | High — EU multinational job description design | esco.ec.europa.eu/en/use-esco/download | Yes |
| 3 | OpenAlex Topics | ~65,000 | Name + Wikipedia description + academic field + subfield | Very Low — academic research, not HR | openalex.org | Yes |
| 4 | Wikipedia via Hugging Face | ~6.7M | Full article text, deepest descriptions of any source | Very Low — academic, filterable by category | huggingface.co/datasets/wikimedia/wikipedia | Yes |
| 5 | ERIC Thesaurus | ~11,000 | Name + scope note + used-for + broader terms | Very Low — education research only | eric.ed.gov | Yes |
| 6 | MeSH | ~30,000 | Name + scope note + entry terms + tree hierarchy | Very Low — biomedical only | nlm.nih.gov/mesh/meshhome.html | Yes |

> [!note] Knowledge is ranked by academic depth and breadth, not HR relevance — that is intentional. Knowledge maps to academia. HR relevance is the role of Skills and Tasks.

---

## Top Sources — Task Abilities

| Rank | Source | Instances | Description Context | HR / Corporate Relevance | URL | Free |
| ---- | ------------------------------------ | --------- | --------------------------------------------------------------- | ---------------------------------------------------------- | -------------------------------------- | ---- |
| | | | | | | |
| 1 | O*NET Tasks | ~19,000 | Task statement + occupation name + occupation description | Very High — core US job description and benchmarking tool | onetcenter.org/database.html | Yes |
| 2 | ESCO Occupation Tasks | ~15,000 | Task statement + occupation title + sector | High — EU multinational job design | esco.ec.europa.eu/en/use-esco/download | Yes |
| 3 | SOC Task Statements | ~10,000 | Task statement + occupation title + SOC description | Very High — US labor market standard | bls.gov/soc | Yes |
| 4 | O*NET Detailed Work Activities | ~2,500 | Action-verb task statement + parent work activity category | Very High — precise job descriptions and competency models | onetcenter.org/database.html | Yes |
| 5 | O*NET Abilities + Behavioral Anchors | ~52 | Name + definition + behavioral examples at 7 proficiency levels | High — structured hiring assessments | onetcenter.org/database.html | Yes |

> [!success] Combined Task Coverage

> ~57,000 task entries across all sources. Estimated **85–95% coverage** of real-world job market tasks.

---

## The Graph

```mermaid

flowchart LR

K[Knowledge Node] --> S[Skill Node]

T[Task Node] --> S

S --> O[Occupation Node]

T --> O

K --> K2[Knowledge Node]

S --> S2[Skill Node]

```

### Node Types

| Node | Key Properties |
| ---------- | --------------------------------------------------------- |
| Skill | name · description · type · source · embedding |
| Knowledge | name · description · field · subfield · level · embedding |
| Task | name · description · action_verb · complexity · embedding |
| Occupation | name · SOC code · ISCO code · sector · embedding |

### Edge Types

| Edge | Direction | Meaning |
| ----------------- | --------------------- | --------------------------------- |
| `REQUIRED_FOR` | Knowledge → Skill | This knowledge feeds this skill |
| `REQUIRED_FOR` | Task → Skill | This task demonstrates this skill |
| `REQUIRED_BY` | Skill → Occupation | This skill is needed for this job |
| `PERFORMED_IN` | Task → Occupation | This task is done in this job |
| `PREREQUISITE_OF` | Knowledge → Knowledge | Learn this before that |
| `RELATED_TO` | Skill → Skill | These skills co-occur |

### Edge Properties

| Property | Type | Description |
| ------------------- | ------------------------------- | ------------------------------- |
| `weight` | 0.0 – 1.0 | Strength of relationship |
| `proficiency_level` | 1 – 7 | O*NET scale |
| `source` | string | Which dataset derived this edge |
| `confidence` | low · medium · high · very high | Trust signal for builders |

---

## Edge Confidence — Backed by Data

> [!note] How Edges Are Derived

> O*NET and ESCO do not directly say "Knowledge A feeds Skill B". Edges are derived through **co-occurrence in occupation**:

> If `Knowledge(K)` scores high importance for `Occupation(O)`
> AND `Skill(S)` scores high importance for `Occupation(O)`
>
> → Edge: `K ──REQUIRED_FOR──► S` with weight = average of both importance scores

> Run across 1,000 occupations. More occupations confirming the same pair = higher confidence.


| Edge | Confidence | Derived Instances | Reason |
| ---------------------- | ------------- | --------------------- | ----------------------------------------------------------------------------------------------------- |
| Skill → Occupation | **Very High** | ~100,000+ edges | O*NET Skills.txt + ESCO both directly map this. Two independent sources. No inference needed. |
| Task → Occupation | **Very High** | ~34,000 edges | O*NET Tasks.txt + ESCO both directly map this. Direct data, not derived. |
| Knowledge → Occupation | **Very High** | ~20,000–25,000 edges | O*NET Knowledge.txt directly maps 33 knowledge areas across 1,000 occupations with importance scores. |
| Task → Skill | **High** | ~50,000–100,000 edges | Co-occurrence across 1,000 occupations. Confirmed by both O*NET and ESCO independently. |
| Knowledge → Skill | **High** | ~5,000–10,000 edges | Co-occurrence across 1,000 occupations. Fewer unique knowledge areas (33) limits total edges. |
| Skill → Skill | **High** | ~15,000–50,000 edges | ESCO has explicit broader/narrower relationships. Lightcast has large-scale skill co-occurrence data. |
| Knowledge → Knowledge | **Medium** | ~20,000–50,000 edges | OpenAlex citation networks + Wikipedia category hierarchy. Inferential, not directly sourced. |
| Task → Task | **Medium** | ~2,500–5,000 edges | O*NET DWAs have parent-child structure across ~300 parent activities. Limited data beyond that. |

### Multi-Source Confirmation Rule

> [!tip] Trust Signal for Builders

> - Edge confirmed by **1 source** → confidence: Medium
> - Edge confirmed by **2 sources** → confidence: High
> - Edge confirmed by **3+ sources** → confidence: Very High

### Real Example

```mermaid

flowchart LR

CS["Computer Science\nweight: 0.97"] -->|"REQUIRED_FOR | HIGH"| P[Programming]

M["Mathematics\nweight: 0.84"] -->|"REQUIRED_FOR | HIGH"| P

W["Write and test code\nweight: 0.98"] -->|"REQUIRED_FOR | HIGH"| P

P -->|REQUIRED_BY| SD[Software Developer]

```

> O*NET importance scores for Software Developer:
> - Knowledge: Computer Science → 4.8 / 5.0
> - Knowledge: Mathematics → 4.2 / 5.0
> - Skill: Programming → 4.9 / 5.0
> - Task: Write and test code → 4.7 / 5.0

> ESCO confirms: Programming listed as **essential** skill under Software Developer ✓

---

## Full System Flow

```mermaid

flowchart TD

I["User Input\n5 years Python · building and deploying ML pipelines"] --> E[extract-module]



E --> SK["Skills\nPython · ML Engineering"]

E --> KN["Knowledge\nML Theory · Statistics"]

E --> TA["Tasks\nDeploy ML pipelines · Build ML models"]



SK --> G[Graph Lookup]

KN --> G

TA --> G



G --> M[Matched to taxonomy nodes]

G --> P[Edges reveal implied knowledge]

G --> GA[Gaps surfaced]

G --> OC[Occupations matched]

G --> LP[Learning paths generated]

```

---

## Job Market Coverage

| Type | Sources Used | Estimated Coverage |
| -------------- | ----------------------------------- | ------------------ |
| Skills | Lightcast + ESCO + O*NET | 97 – 99% |
| Knowledge | O*NET + ESCO + OpenAlex + Wikipedia | 70 – 80% |
| Task Abilities | O*NET + ESCO + SOC | 85 – 95% |

---
## What Builders Get

> [!abstract] Builder Capabilities

> Any product built on LAiSER gets access to:

- **Gap Analysis** — given what someone knows and can do, what skills do they have and what are they missing

- **Learning Paths** — to reach Skill X, acquire Knowledge Y+Z and practice Tasks P+Q

- **Job Matching** — trace occupation requirements back to knowledge and task prerequisites

- **Skill Inference** — never seen a skill before? embed it, find nearest neighbors in the graph

- **Workforce Planning** — given a team's knowledge and tasks, map collective skill coverage

---

## Next Steps



- [ ] Finalize source selection for Knowledge index

- [ ] Finalize source selection for Task Abilities index

- [ ] Design data pipeline — download, clean, embed all sources

- [ ] Build graph database schema

- [ ] Derive edges from O*NET and ESCO co-occurrence

- [ ] Validate edge confidence with multi-source confirmation
2 changes: 1 addition & 1 deletion laiser/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
A Python package for extracting and aligning skills from text using AI models.
"""

__version__ = "0.4.0.1"
__version__ = "0.5"

# Import main classes for easy access
try:
Expand Down
Loading
Loading