Implementation:Recommenders team Recommenders MicrosoftAcademicGraph
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Data Loading, Academic Datasets |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Provides the MicrosoftAcademicGraph class for reading Microsoft Academic Graph (MAG) data streams into pandas DataFrames.
Description
The MicrosoftAcademicGraph class defines a comprehensive schema dictionary mapping over 20 MAG entity types (Papers, Authors, Affiliations, Journals, ConferenceSeries, ConferenceInstances, FieldsOfStudy, PaperReferences, PaperRecommendations, and more) to their column names and data types. Each stream definition is a list of "FieldName:Type" strings where types include int, uint, long, ulong, float, string, and DateTime, with optional nullable markers (?). The class provides methods to resolve file paths from a root directory, parse column headers and types into pandas-compatible dtypes, handle date parsing with a custom format (%m/%d/%Y %H:%M:%S %p), and load tab-separated MAG files into typed pandas DataFrames via pd.read_csv. A datatypedict maps MAG types to pandas/numpy equivalents including pd.Int32Dtype(), pd.Int64Dtype(), np.float32, and np.string_.
Usage
Use this class within the KDD 2020 tutorial workflow when you need to load and parse Microsoft Academic Graph data files. Instantiate with the root directory containing MAG data files, then call get_data_frame with a stream name to load the corresponding entity data.
Code Reference
Source Location
- Repository: Recommenders
- File: examples/07_tutorials/KDD2020-tutorial/utils/PandasMagClass.py
- Lines: 1-249
Signature
class MicrosoftAcademicGraph:
def __init__(self, root)
def get_full_path(self, stream_name) -> str
def get_header(self, stream_name) -> list
def get_type(self, stream_name) -> tuple[dict, list]
def get_name(self, stream_name) -> list
def get_data_frame(self, stream_name) -> pd.DataFrame
Import
from utils.PandasMagClass import MicrosoftAcademicGraph
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| root | str | Yes | Root directory path containing the MAG data files (tab-separated .txt files). |
| stream_name | str | Yes | Name of the MAG entity stream to load (e.g., "Papers", "Authors", "Affiliations", "Journals", "FieldsOfStudy", "PaperReferences", "PaperRecommendations"). |
Outputs
| Name | Type | Description |
|---|---|---|
| return (get_data_frame) | pd.DataFrame | DataFrame with typed columns matching the MAG schema for the requested stream, including parsed DateTime columns. |
| return (get_full_path) | str | Full file path to the stream's .txt file. |
| return (get_header) | list | List of "FieldName:Type" schema strings for the stream. |
| return (get_type) | tuple(dict, list) | Tuple of (column_type_dict, date_column_names) for the stream. |
| return (get_name) | list | List of column names for the stream. |
Usage Examples
Basic Usage
from utils.PandasMagClass import MicrosoftAcademicGraph
# Initialize with root directory containing MAG files
mag = MicrosoftAcademicGraph("/data/mag/")
# Load the Papers stream into a DataFrame
papers_df = mag.get_data_frame("Papers")
print(papers_df.columns.tolist())
# ['PaperId', 'Rank', 'Doi', 'DocType', 'PaperTitle', ...]
# Load Authors stream
authors_df = mag.get_data_frame("Authors")
# Get the schema for a stream
schema, date_cols = mag.get_type("Papers")
print(date_cols) # ['Date', 'OnlineDate', 'CreatedDate']
# Get file path for a stream
path = mag.get_full_path("Journals")
# Returns: "/data/mag/Journals.txt"
Dependencies
- pandas - DataFrame construction and CSV parsing
- numpy - Numeric type definitions (float32, string_)