Architecture¶
WebhookDB’s codebase is separated into a multilayered architecture. This document will describe the layers from the ground up.
Database Models¶
Like all web applications, WebhookDB stores information in a database. The
information is organized into conceptual models, most of which are directly
pulled from GitHub. These models are defined using the SQLAlchemy ORM, and
they are located in the models
directory of the project.
Many of these models inherit from the
ReplicationTimestampMixin
, which automatically
adds two database columns: last_replicated_via_webhook_at
and
last_replicated_via_api_at
. This allows future database queries to determine
how stale the data is. There is also a virtual property simply called
last_replicated_at
– this returns the more recent of these two columns.
Data Processing¶
The next layer is the data processing layer, which is stored in the process
directory of the project. This layer consists of functions which accept
the parsed JSON output of the GitHub API responses, and updates the database
to reflect the information provided in that parsed JSON. Each data model has its
own data processing function: the User
model
has a corresponding process_user()
function, for
example, and the PullRequest
model has a
corresponding process_pull_request()
function.
API responses often include nested data: for example, if you request information
about a pull request from GitHub’s pull request API, it will include detailed
user information about the author of the pull request, even though that is
information that should be stored in the User
model, not the PullRequest
model. Each
data processing function will only process the data for the model that it
is named for, but it will delegate nested data to the data processing function
for that nested data type. This means that
process_pull_request()
calls
process_user()
, for example.
It’s important to note that functions in the data processing layer do not know where the data came from, and for the most part, they don’t care. The data might come from an API response, or from a webhook notification. It might be top-level, or it might be some nested data that a different data processing function passed to it. These functions never seek out data on their own, but instead they are called by functions that retrieve the data. This means that functions in the data processing layer never make HTTP requests, although they can and do make database queries.
Celery Tasks¶
The next layer is the Celery tasks, which are stored in the tasks
directory. This layer makes HTTP requests to GitHub’s API, and passes the
results of those requests on to the data processing layer. HTTP requests can be
slow, and they can fail for any number of reasons (networking problems, problems
on GitHub’s end, rate limiting issues, etc), so we use the Celery task queue
to make these tasks more robust against failure.
Fetching data for an individual model, such as a single user or a single pull
request, is relatively straightforward, and is handled by the “sync” task
for the data model. For example, webhookdb.tasks.user.sync_user()
will fetch data for an individual user, and
webhookdb.tasks.pull_request.sync_pull_request()
will fetch data for an
individual pull request.
Fetching data for a group of models, such as all pull requests
in a repository, is much more complicated. GitHub’s API responses are paginated,
so it’s natural to work on a per-page basis. For each data model, there is a
“spawn page tasks” task, which makes a single API call to determine
how many pages there are in the response. Based on that information, it calls
the “sync page” task as many times as necessary: that task will make
a single HTTP request to retrieve the indicated page of the API response,
and will call the data processing functions for each item in the page. (Note
that all of the “sync page” functions can be processed in parallel with each
other.) Once all of the “sync page” tasks have completed, there is a “scanned”
task that gets called, which handles any cleanup work necessary to indicate
that the group of models is done being scanned. For example, to fetch data
for all pull requests in a repository, the relevant tasks are
webhookdb.tasks.pull_request.spawn_page_tasks_for_pull_requests()
,
webhookdb.tasks.pull_request.sync_page_of_pull_requests()
,
and webhookdb.tasks.pull_request.pull_requests_scanned()
.
Note that this uses Celery’s chord workflow, and it is subject to all of the performance issues of that workflow.
Replication HTTP endpoints¶
The replication layer is stored in the replication
directory, and it
consists of a Flask blueprint designed to be used
by the webhook system on GitHub. Once your repository on GitHub has its
replication webhooks set up properly, GitHub will make an HTTP request to
this endpoint every time an event happens on GitHub. The replication endpoint
will pass the data in that request to the data processing layer, and will
queue celery tasks to update other information if necessary. (For example,
when a pull request is updated, the pull request files must be rescanned,
so the replication endpoint will queue the
webhookdb.tasks.pull_request_file.spawn_page_tasks_for_pull_request_files()
task.) This layer also handles the ping
event that GitHub sends to all
webhook endpoints as a test.
Load HTTP endpoints¶
Sometimes, users want to tell WebhookDB that it should load data from GitHub
directly, rather than waiting for that data to replicate to WebhookDB via
webhooks. The load layer is stored in the load
directory, and it consists
of a Flask blueprint that is designed to mirror the
GitHub API fairly closely. When a user sends a POST request to one of these
endpoints, WebhookDB will queue a Celery task to load the requested data from
the GitHub API.
User Interface¶
The user interface is stored in the ui
directory, and it consists of a
Flask blueprint of pages that return HTML web pages,
rather than a JSON API.