Graph services

In Renku, the dependencies of research artifacts are recorded into a knowledge graph. Each project’s local knowledge graph is recorded in its repository; the creation of the global knowledge graph is possible via the graph services. When a project’s repository is pushed to the server, a webhook is triggered that causes the changes represented by the commits and all of the captured dependencies to be rendered as RDF triples and pushed to the triple store.

The graph services are made up of four micro-services: the webhook-service, triples-generator, token-repository and knowledge-graph. The knowledge graph data is stored in the triple store (currently Apache Jena). The basic architecture is illustrated below.

strict digraph architecture { compound=true; newrank=true; graph [fontname="Raleway", nodesep="0.8"]; node [shape="rect", style="filled,rounded", fontname="Raleway"]; edge [fontname="Raleway"] GitLab [fillcolor="lightblue"] UI [fillcolor="#f4d142"] CLI [fillcolor="#f4d142"] WHS [label="Webhook Service" fillcolor="#f4d142"] TG [label="Triples Generator" fillcolor="#f4d142"] KG [label="Knowledge Graph" fillcolor="#f4d142"] Gateway [fillcolor="#f4d142"] Jena [label="Apache Jena" fillcolor="lightblue"] Log [label="Event Log" fillcolor="#f4d142", shape="parallelogram", width=2.0] LogDB [label="Event Log DB" fillcolor="lightblue", shape="parallelogram", width=2.0] subgraph cluster_clients { label="Clients" UI CLI {rank=same; UI, CLI}; } CLI -> GitLab [label=" git push"] WHS -> GitLab [label=" registers webhooks"] GitLab -> WHS [label=" sends Push Event\nwith information about pushed commits"] WHS -> Log [label=" writes Commit Events"] Log -> LogDB [label=" stores Commit Events"] TG -> Log [label=" subscribes for Events"] Log -> TG [label=" pushes Commit Events"] TG -> Jena [label=" generates RDF triples"] KG -> Jena [label=" SPARQL query"] UI -> Gateway [label=" interacts with Graph Services"] Gateway -> WHS [label=" asks to register webhooks,\nchecks Events processing status"] Gateway -> KG [label=" queries for metadata"] }

Sequence diagram of Graph Services APIs and processes.

POST <knowledge-graph>/knowledge-graph/graphql

An endpoint that allows performing GraphQL queries on the Knowledge Graph data.

    @startuml
    hide footbox
    skinparam shadowing false

    actor Client
    participant "Knowledge\nGraph" as KnowledgeGraph
    database "RDF Store" as Jena

    Client->KnowledgeGraph: POST /knowledge-graph/graphql
    activate KnowledgeGraph
    KnowledgeGraph->Jena: execute SPARQL query
    activate Jena
    Jena->KnowledgeGraph: query results
    deactivate Jena
    KnowledgeGraph->Client: OK (200) with the requested data
    deactivate KnowledgeGraph

    @enduml

POST <webhook-service>/projects/:id/webhooks

An endpoint to create a Graph Services webhook for a project in GitLab.

    @startuml
    hide footbox
    skinparam shadowing false

    actor Client
    participant "Webhook\nService" as Webhook
    participant GitLab
    participant "Tokens\nRepository" as Tokens
    participant "Event\nLog" as Log

    == Webhook creation - hook does exists ==
    Client->Webhook: POST /projects/:id/webhooks
    activate Webhook
    ref over Webhook, Log: Hook Validation as described for\n**POST <webhook-service>/projects/:id/webhooks/validation**\nreturning OK
    Webhook->Client: OK (200)
    deactivate Webhook

    == Webhook creation - hook does not exist ==
    Client->Webhook: POST /projects/:id/webhooks
    activate Webhook
    ref over Webhook, Log: Hook Validation as described for\n**POST <webhook-service>/projects/:id/webhooks/validation**\nreturning NOT_FOUND
    Webhook->GitLab: GET /api/v4/projects/:id
    activate GitLab
    GitLab->Webhook: OK (200) with project info
    deactivate GitLab
    Webhook->GitLab: POST /api/v4/projects/:id/hooks
    activate GitLab
    GitLab->Webhook: OK (200)
    deactivate GitLab
    Webhook->Tokens: PUT /projects/:id/tokens
    activate Tokens
    Tokens->Webhook: NO_CONTENT (204)
    deactivate Tokens
    group Commits history loading
    Webhook->GitLab: GET /api/v4/projects/:id/repository/commits
    activate GitLab
    GitLab->Webhook: OK (200) with the latest Commit
    deactivate GitLab
    ref over Webhook, Log: Latest Commit to Event Log Events as described for\n**POST <webhook-service>/webhooks/events**
    end
    Webhook->Client: CREATED (201)
    deactivate Webhook

    @enduml

POST <webhook-service>/projects/:id/webhooks/validation

An endpoint to validate project’s webhook. It checks if a relevant Graph Services webhook exists on the repository in GitLab and if Graph Services have an Access Token associated with the project so they can use it for finding project specific information in GitLab.

    @startuml
    hide footbox
    skinparam shadowing false

    actor Client
    participant "Webhook\nService" as Webhook
    participant GitLab
    participant "Tokens\nRepository" as Tokens

    == Webhook validation - valid Access Token ==
    Client->Webhook: POST /projects/:id/webhooks/validation
    activate Webhook
    Webhook->GitLab: GET /api/v4/projects/:id
    activate GitLab
    GitLab->Webhook: OK (200) with project visibility
    deactivate GitLab
    Webhook->GitLab: GET /api/v4/projects/:id/hooks
    activate GitLab
    GitLab->Webhook: OK (200) with project hooks
    deactivate GitLab
    alt A relevant Graph Services hook exists
    Webhook->Tokens: PUT /projects/:id/tokens (if a relevant Graph Services hook exists)
    activate Tokens
    Tokens->Webhook: NO_CONTENT (204)
    deactivate Tokens
    end
    Webhook->Client: OK (200) if hook exists, NOT_FOUND (404) if it doesn't
    deactivate Webhook

    == Webhook validation - invalid Access Token ==
    Client->Webhook: POST /projects/:id/webhooks/validation
    activate Webhook
    Webhook->GitLab: GET /api/v4/projects/:id
    activate GitLab
    GitLab->Webhook: UNAUTHORIZED (401)
    deactivate GitLab
    Webhook->Tokens: GET /projects/:id/tokens
    activate Tokens
    Tokens->Webhook: OK (200) with project token or NOT_FOUND (404)
    deactivate Tokens
    Webhook->GitLab: GET /api/v4/projects/:id
    activate GitLab
    GitLab->Webhook: OK (200) with project visibility
    deactivate GitLab
    Webhook->GitLab: GET /api/v4/projects/:id/hooks
    activate GitLab
    GitLab->Webhook: OK (200) with project hooks
    deactivate GitLab
    alt A relevant Graph Services hook exists
    Webhook->Tokens: PUT /projects/:id/tokens
    activate Tokens
    Tokens->Webhook: NO_CONTENT (204)
    deactivate Tokens
    else A relevant Graph Services hook does not exist
    Webhook->Tokens: DELETE /projects/:id/tokens
    activate Tokens
    Tokens->Webhook: NO_CONTENT (204)
    deactivate Tokens
    end
    Webhook->Client: OK (200) if hook exists, NOT_FOUND (404) if it doesn't
    deactivate Webhook

    @enduml

POST <webhook-service>/webhooks/events

An endpoint to send Push Events containing information about commits pushed to the GitLab.

    @startuml
    hide footbox
    skinparam shadowing false

    actor Client
    participant "Webhook\nService" as Webhook
    participant GitLab
    participant "Tokens\nRepository" as Tokens
    participant "Event\nLog" as Log

    == Push Event ==
    Client->Webhook: POST /webhooks/events
    activate Webhook
    Webhook->Tokens: GET /projects/:id/tokens
    activate Tokens
    Tokens->Webhook: OK (200) with Access Token or NOT_FOUND (404)
    deactivate Tokens
    group Not processed commits finding process
    Webhook->GitLab: GET /api/v4/projects/:id/repository/commits/:id
    activate GitLab
    GitLab->Webhook: OK (200) with commit info
    deactivate GitLab
    Webhook->Log: POST /events to store Commit Event
    activate Log
    Log->Webhook: CREATED (201) or OK (200)
    deactivate Log
    Webhook->Webhook: Repeat the process if got CREATED\nor terminate if OK
    end
    Webhook->Client: OK (200)
    deactivate Webhook

    @enduml

GET <webhook-service>/projects/:id/events/status

An endpoint that returns information about processing progress of events for a specific project.

    @startuml
    hide footbox
    skinparam shadowing false

    actor Client
    participant "Webhook\nService" as Webhook
    participant GitLab
    participant "Event\nLog" as Log

    == Events processing status - events being processed now ==
    Client->Webhook: GET /projects/:id/events/status
    activate Webhook
    ref over Webhook, Log: Hook Validation as described for\n**POST <webhook-service>/projects/:id/webhooks/validation**\nreturning OK
    Webhook->Log: GET /processing-status?project-id=:id
    activate Log
    Log->Webhook: OK (200) with done, total and progress\nof events in the last events batch
    deactivate Log
    Webhook->Client: OK (200) with processing progress info
    deactivate Webhook

    == Events processing status - all events from the last events batch processed ==
    Client->Webhook: GET /projects/:id/events/status
    activate Webhook
    ref over Webhook, Log: Hook Validation as described for\n**POST <webhook-service>/projects/:id/webhooks/validation**\nreturning OK
    Webhook->Log: GET /processing-status?project-id=:id
    activate Log
    Log->Webhook: OK (200) with done = total (all events from the last events batch)
    deactivate Log
    Webhook->Client: OK (200) with progress = 100%
    deactivate Webhook

    == Events processing status - no events for a given project ==
    Client->Webhook: GET /projects/:id/events/status
    activate Webhook
    ref over Webhook, Log: Hook Validation as described for\n**POST <webhook-service>/projects/:id/webhooks/validation**\nreturning OK
    Webhook->Log: GET /processing-status?project-id=:id
    activate Log
    Log->Webhook: NOT_FOUND (404)
    deactivate Log
    Webhook->Client: OK (200) with done = total = 0
    deactivate Webhook

    == Events processing status - no Graph Services webhook for a given project ==
    Client->Webhook: GET /projects/:id/events/status
    activate Webhook
    ref over Webhook, Log: Hook Validation as described for\n**POST <webhook-service>/projects/:id/webhooks/validation**\nreturning NOT_FOUND
    Webhook->Client: NOT_FOUND (404)
    deactivate Webhook

    @enduml

Subscription to unprocessed Commit Events

A process initiated and maintained by Triples Generator instances so Event Log can send them Events requiring generation of triples.

    @startuml
    hide footbox
    skinparam shadowing false

    participant "Triples\nGenerator" as TG
    participant "Event\nLog" as EL

    == Subscription for Commit Events with status NEW or RECOVERABLE_FAILURE in the Event Log ==
    TG->EL: POST /subscriptions
    activate EL
    EL->TG: ACCEPTED (202)
    deactivate EL

    @enduml

Commit Events to RDF Triples

A process responsible for translating unprocessed Commit Events from the Event Log to RDF Triples in the RDF Store. This process runs continuously by polling the Event Log for unprocessed Commit Events.

    @startuml
    hide footbox
    skinparam shadowing false

    database "Event\nLog DB" as ELDB
    participant "Event\nLog" as EL
    participant "Triples\nGenerator" as TG
    participant "Tokens\nRepository" as TR
    database "RDF Store" as Jena

    == Commit Event to RDF Triples ==
    EL->ELDB: pops Commit Event having\nstatus NEW or RECOVERABLE_FAILURE\nand mark it as PROCESSING
    activate EL
    EL->TG: POST /events
    activate TG
    TG->EL: ACCEPTED (202)\nor TOO_MANY_REQUESTS (429)
    deactivate EL
    TG->TR: GET /projects/:id/tokens
    activate TR
    TR->TG: OK (200) with the Access Token\nor NOT_FOUND (404)
    deactivate TR
    TG->TG: Run '//renku log//' to create RDF Triples
    TG->TG: Parse RDF Triples
    TG->TG: Curate RFG Triples
    TG->Jena: Store RDF Triples
    TG->EL: PATCH /events/:event-id/:project-id\nto change Event's status to TRIPLES_STORE,\nRECOVERABLE_FAILURE or NON_RECOVERABLE_FAILURE
    deactivate TR

    @enduml

Missed commits synchronization job

A scheduled job which synchronizes state between the Event Log and GitLab and generates Commit Events missing from the Event Log. It runs periodically with a configured interval.

    @startuml
    hide footbox
    skinparam shadowing false

    participant "Webhook\nService" as Webhook
    participant GitLab
    participant "Tokens\nRepository" as Tokens
    participant "Event\nLog" as Log
    participant "Triples\nGenerator" as Triples
    database "RDF Store" as Jena

    == Missed Events Synchronisation Job ==
    Webhook->Webhook: Trigger Events Synchronisation process
    activate Webhook
    Webhook->Log: GET /events?latest_per_project=true\nto find the latest Events of all the projects
    activate Log
    Log->Webhook: OK (200) with a list of the latest Events of all the projects
    deactivate Log
    group Repeat for all the found projects
    Webhook->Tokens: GET /projects/:id/tokens
    activate Tokens
    Tokens->Webhook: OK (200) with Access Token or NOT_FOUND (404)
    deactivate Tokens
    Webhook->GitLab: GET /api/v4/projects/:id/repository/commits
    activate GitLab
    GitLab->Webhook: OK (200) with the latest Commit
    deactivate GitLab
    Webhook->GitLab: GET /api/v4/projects/:id
    activate GitLab
    GitLab->Webhook: OK (200) with the Project Info
    deactivate GitLab
    ref over Webhook, Log: Create missing Commit Events and store them in the Event Log as in the\n**POST <webhook-service>/webhooks/events**
    deactivate Webhook
    end
    ref over Log, Jena: Commit Event to RDF Triples

    @enduml

Knowledge Graph re-provisioning process

A process executed on Triples Generator start-up that checks if triples in the RDF Store were generated with the version of renku-python currently set in the Triples Generator.

    @startuml
    hide footbox
    skinparam shadowing false

    participant "Triples\nGenerator" as TG
    participant "Event\nLog" as EL
    database "RDF Store" as Jena

    == Knowledge Graph re-provisioning process - triples generated with the recent version of renku-python ==
    TG->TG: trigger the re-provisioning process
    activate TG
    TG->Jena: queries for version of renku-python used to generate triples
    TG->TG: versions match so nothing to be done
    deactivate TG

    == Knowledge Graph re-provisioning process - triples generated with some older version of renku-python ==
    TG->TG: trigger the re-provisioning process
    activate TG
    TG->Jena: queries for version of renku-python used to generate triples
    TG->TG: versions does not match
    TG->Jena: remove all the triples from the RDF store
    TG->EL: PATCH /events to trigger process of scheduling events for re-processing
    activate EL
    EL->TG: ACCEPTED (202)
    deactivate EL
    deactivate TG

    @enduml