Implementation:Recommenders team Recommenders KDD2020 Task Helper
| Knowledge Sources | |
|---|---|
| Domains | Data Preprocessing, Knowledge Graphs, Academic Recommendation |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
The KDD2020 Task Helper module provides the complete data processing pipeline for the KDD 2020 tutorial, transforming raw Microsoft Academic Graph (MAG) data into model-ready training files for DKN and LightGCN recommendation experiments.
Description
This module contains over 20 functions that collectively orchestrate the data preparation workflow for academic paper recommendation. The pipeline spans several key stages:
Paper content generation is handled by gen_paper_content, which reads tab-separated paper title and abstract sentences, converts words and entities to integer indices using dictionaries (word2idx and entity2idx), and writes fixed-length feature vectors to output files. The convert2id and parse_entities functions support this by mapping sentence tokens and field-of-study entity annotations to their respective indices.
Knowledge graph construction is performed by gen_knowledge_relations, which processes related field-of-study files to create entity-relation triples in the TransE training format (head, tail, relation), outputting train2id.txt, entity2id.txt, and relation2id.txt files.
User behavior extraction is handled by get_author_reference_list, which builds author citation histories by tracing which papers each author cited and when, producing temporally ordered reference lists. The gen_experiment_splits function then creates train/valid/test splits with configurable negative sampling using multiprocess parallelism for efficiency.
Paper similarity computation is implemented in gen_paper_cocitation, which calculates both co-citation and co-reference counts between papers and applies optional normalization based on citation list lengths, with configurable minimum thresholds.
Embedding preparation functions (format_word_embeddings, format_knowledge_embeddings, gen_context_embedding) convert text-based embedding files into NumPy arrays, and build context embeddings by averaging knowledge graph neighbor vectors.
LightGCN data preparation is performed by prepare_dataset and its helpers, which load instance files, expand user behavior histories, and write formatted training and validation files for the LightGCN model.
Usage
Use this module when following the KDD 2020 tutorial notebooks (steps 1 through 5) for building academic paper recommendation systems. It is designed for processing Microsoft Academic Graph data and preparing it for both knowledge-aware (DKN) and graph-based (LightGCN) recommendation models. The functions are typically called from Jupyter notebooks in sequence to build the complete data pipeline.
Code Reference
Source Location
- Repository: Recommenders
- File: examples/07_tutorials/KDD2020-tutorial/utils/task_helper.py
- Lines: 1-926
Signature
def gen_paper_content(
InFile_PaperTitleAbs_bySentence,
OutFileName,
word2idx,
entity2idx,
field=["Title"],
doc_len=10,
): ...
def parse_entities(fieldOfStudy, entity2idx, cnt): ...
def convert2id(sentence, fieldOfStudy, word2idx, entity2idx): ...
def gen_knowledge_relations(
InFile_RelatedFieldOfStudy, OutFile_dirname, entity2idx, relation2idx
): ...
def gen_indexed_sentence_collection(
InFile_PaperTitleAbs_bySentence, OutFileName, word2idx
): ...
def gen_sentence_collection(
InFile_PaperTitleAbs_bySentence, OutFileName, word2idx
): ...
def get_author_reference_list(
author2paper_list, paper2reference_list, paper2date
): ...
def gen_experiment_splits(
file_Author2ReferencePapers,
OutFile_dir,
InFile_paper_feature,
tag,
item_ratio=1.0,
process_num=1,
): ...
def gen_paper_cocitation(InFile_PaperReference, norm=True): ...
def format_word_embeddings(word_vecfile, word2id_file, np_file): ...
def format_knowledge_embeddings(transE_vecfile, np_file): ...
def gen_context_embedding(entity_file, context_file, kg_file, dim): ...
def prepare_dataset(output_folder, input_folder, tag): ...
def group_labels(labels, preds, group_keys): ...
Import
from utils.task_helper import (
gen_paper_content,
gen_knowledge_relations,
gen_experiment_splits,
gen_paper_cocitation,
format_word_embeddings,
format_knowledge_embeddings,
gen_context_embedding,
prepare_dataset,
group_labels,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| InFile_PaperTitleAbs_bySentence | str | Yes | Path to tab-separated file containing paper ID, category, position, sentence text, and field-of-study annotations |
| OutFileName | str | Yes | Output file path for generated content features or sentence collections |
| word2idx | dict | Yes | Mutable dictionary mapping word strings to integer indices, updated in place during processing |
| entity2idx | dict | Yes | Mutable dictionary mapping entity IDs to integer indices, updated in place during processing |
| field | list of str | No | Which fields to include, e.g. ["Title"] or ["Title", "Abstract"] (default: ["Title"]) |
| doc_len | int | No | Fixed document length for padding or truncating feature vectors (default: 10) |
| InFile_RelatedFieldOfStudy | str | Yes | Path to file containing related field-of-study triples for knowledge graph construction |
| relation2idx | dict | Yes | Mutable dictionary mapping relation names to integer indices |
| author2paper_list | dict | Yes | Dictionary mapping author IDs to their published paper lists |
| paper2reference_list | dict | Yes | Dictionary mapping paper IDs to their reference (cited paper) lists |
| paper2date | dict | Yes | Dictionary mapping paper IDs to publication dates |
| file_Author2ReferencePapers | str | Yes | Path to author-to-reference-papers file for generating experiment splits |
| InFile_paper_feature | str | Yes | Path to paper feature file used to filter items that have features |
| tag | str | Yes | Tag string used for naming output files (e.g., "citeulike") |
| item_ratio | float | No | Fraction of items to sample for experiments (default: 1.0) |
| process_num | int | No | Number of parallel processes for negative sampling (default: 1) |
| InFile_PaperReference | str | Yes | Path to paper reference file for co-citation computation |
| norm | bool | No | Whether to apply normalization to co-citation and co-reference scores (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| gen_paper_content return | tuple(dict, dict) | Updated word2idx and entity2idx dictionaries after processing all papers |
| get_author_reference_list return | dict | Dictionary mapping author IDs to temporally sorted lists of (paper_id, publish_date, cited_date) tuples |
| gen_paper_cocitation return | tuple(dict, dict) | Pair of dictionaries: co-citation counts and co-reference counts between paper pairs, optionally normalized |
| group_labels return | tuple(list, list) | Lists of grouped labels and predictions, divided by group keys for evaluation |
| gen_experiment_splits side effects | files | Writes train, valid, test, user_history, and item2freq files to OutFile_dir |
| prepare_dataset side effects | files | Writes lightgcn_train and lightgcn_valid files to output_folder |
Usage Examples
Paper Content Generation
from utils.task_helper import gen_paper_content
word2idx = {}
entity2idx = {}
# Generate paper content features from title and abstract sentences
word2idx, entity2idx = gen_paper_content(
InFile_PaperTitleAbs_bySentence="data/PaperTitleAbs_bySentence.txt",
OutFileName="data/paper_content.txt",
word2idx=word2idx,
entity2idx=entity2idx,
field=["Title", "Abstract"],
doc_len=20,
)
Knowledge Graph Construction
from utils.task_helper import gen_knowledge_relations
entity2idx = {}
relation2idx = {}
gen_knowledge_relations(
InFile_RelatedFieldOfStudy="data/RelatedFieldOfStudy.txt",
OutFile_dirname="data/knowledge/",
entity2idx=entity2idx,
relation2idx=relation2idx,
)
# Produces: train2id.txt, entity2id.txt, relation2id.txt in data/knowledge/
Experiment Splits with Negative Sampling
from utils.task_helper import gen_experiment_splits
gen_experiment_splits(
file_Author2ReferencePapers="data/Author2ReferencePapers.txt",
OutFile_dir="data/splits/",
InFile_paper_feature="data/paper_content.txt",
tag="citeulike",
item_ratio=1.0,
process_num=4,
)
# Produces: train_citeulike.txt, valid_citeulike.txt, test_citeulike.txt
Embedding Preparation
from utils.task_helper import format_word_embeddings, gen_context_embedding
# Convert word embeddings to numpy format
format_word_embeddings(
word_vecfile="data/word2vec.txt",
word2id_file="data/word2idx.pkl",
np_file="data/word_embeddings.npy",
)
# Generate context embeddings from knowledge graph neighbors
gen_context_embedding(
entity_file="data/entity_embeddings.tsv",
context_file="data/context_embeddings.tsv",
kg_file="data/knowledge/train2id.txt",
dim=100,
)
LightGCN Data Preparation
from utils.task_helper import prepare_dataset
prepare_dataset(
output_folder="data/lightgcn/",
input_folder="data/splits/",
tag="citeulike",
)
# Produces: lightgcn_train_citeulike.txt, lightgcn_valid_citeulike.txt