GitHub OSS Governance File Dataset


# Description

Open-source Software (OSS) has become a valuable resource in both industry and academia over the last few decades. Despite the innovative structures they develop to support the projects, OSS projects and their communities have complex needs and face risks such as getting abandoned. To manage the internal social dynamics and community evolution, OSS developer communities have started relying on written governance documents that assign roles and responsibilities to different community actors.
To facilitate the study of the impact and effectiveness of formal governance documents on OSS projects and communities, we present a longitudinal dataset of 710 GitHub-hosted OSS projects with GOVERNANCE.MD governance files. This dataset includes all commits made to the repository, all issues and comments created on GitHub, and all revisions made to the governance file. We hope its availability will foster more research interest in studying how OSS communities govern their projects and the impact of governance files on communities.

# Findings

We present the GitHub Open-Source Software governance documentation dataset. It includes governance files, projects' commit history, and issues for 710 OSS projects hosted on GitHub. To facilitate longitudinal studies, we provide separate tables, capturing the commit history of GOVERNANCE.MD files and detailed changes made in each commit on the GOVERNANCE.MD file at line-level granularity. To the best of our knowledge, this is the first time such a governance-documentation-oriented GitHub-hosted OSS project dataset has been presented to the empirical software engineering community. Next, we present the details of our dataset, scraping methodology, and storage, followed by two preliminary examples of research studies that can benefit from this data. Our dataset, along with the scripts we used to scrape it, is available at Zenodo: https://doi.org/10.5281/zenodo.7530768.

# Paper

The paper can be found here.