Implementation:Recommenders team Recommenders KDD2020 Task Helper

Knowledge Sources	Recommenders
Domains	Data Preprocessing, Knowledge Graphs, Academic Recommendation
Last Updated	2026-02-10 00:00 GMT

Overview

The KDD2020 Task Helper module provides the complete data processing pipeline for the KDD 2020 tutorial, transforming raw Microsoft Academic Graph (MAG) data into model-ready training files for DKN and LightGCN recommendation experiments.

Description

This module contains over 20 functions that collectively orchestrate the data preparation workflow for academic paper recommendation. The pipeline spans several key stages:

Paper content generation is handled by gen_paper_content, which reads tab-separated paper title and abstract sentences, converts words and entities to integer indices using dictionaries (word2idx and entity2idx), and writes fixed-length feature vectors to output files. The convert2id and parse_entities functions support this by mapping sentence tokens and field-of-study entity annotations to their respective indices.

Knowledge graph construction is performed by gen_knowledge_relations, which processes related field-of-study files to create entity-relation triples in the TransE training format (head, tail, relation), outputting train2id.txt, entity2id.txt, and relation2id.txt files.

User behavior extraction is handled by get_author_reference_list, which builds author citation histories by tracing which papers each author cited and when, producing temporally ordered reference lists. The gen_experiment_splits function then creates train/valid/test splits with configurable negative sampling using multiprocess parallelism for efficiency.

Paper similarity computation is implemented in gen_paper_cocitation, which calculates both co-citation and co-reference counts between papers and applies optional normalization based on citation list lengths, with configurable minimum thresholds.

Embedding preparation functions (format_word_embeddings, format_knowledge_embeddings, gen_context_embedding) convert text-based embedding files into NumPy arrays, and build context embeddings by averaging knowledge graph neighbor vectors.

LightGCN data preparation is performed by prepare_dataset and its helpers, which load instance files, expand user behavior histories, and write formatted training and validation files for the LightGCN model.

Usage

Use this module when following the KDD 2020 tutorial notebooks (steps 1 through 5) for building academic paper recommendation systems. It is designed for processing Microsoft Academic Graph data and preparing it for both knowledge-aware (DKN) and graph-based (LightGCN) recommendation models. The functions are typically called from Jupyter notebooks in sequence to build the complete data pipeline.

Code Reference

Source Location

Repository: Recommenders
File: examples/07_tutorials/KDD2020-tutorial/utils/task_helper.py
Lines: 1-926

Signature

def gen_paper_content(
    InFile_PaperTitleAbs_bySentence,
    OutFileName,
    word2idx,
    entity2idx,
    field=["Title"],
    doc_len=10,
): ...

def parse_entities(fieldOfStudy, entity2idx, cnt): ...

def convert2id(sentence, fieldOfStudy, word2idx, entity2idx): ...

def gen_knowledge_relations(
    InFile_RelatedFieldOfStudy, OutFile_dirname, entity2idx, relation2idx
): ...

def gen_indexed_sentence_collection(
    InFile_PaperTitleAbs_bySentence, OutFileName, word2idx
): ...

def gen_sentence_collection(
    InFile_PaperTitleAbs_bySentence, OutFileName, word2idx
): ...

def get_author_reference_list(
    author2paper_list, paper2reference_list, paper2date
): ...

def gen_experiment_splits(
    file_Author2ReferencePapers,
    OutFile_dir,
    InFile_paper_feature,
    tag,
    item_ratio=1.0,
    process_num=1,
): ...

def gen_paper_cocitation(InFile_PaperReference, norm=True): ...

def format_word_embeddings(word_vecfile, word2id_file, np_file): ...

def format_knowledge_embeddings(transE_vecfile, np_file): ...

def gen_context_embedding(entity_file, context_file, kg_file, dim): ...

def prepare_dataset(output_folder, input_folder, tag): ...

def group_labels(labels, preds, group_keys): ...

Import

from utils.task_helper import (
    gen_paper_content,
    gen_knowledge_relations,
    gen_experiment_splits,
    gen_paper_cocitation,
    format_word_embeddings,
    format_knowledge_embeddings,
    gen_context_embedding,
    prepare_dataset,
    group_labels,
)

I/O Contract

Inputs

Name	Type	Required	Description
InFile_PaperTitleAbs_bySentence	str	Yes	Path to tab-separated file containing paper ID, category, position, sentence text, and field-of-study annotations
OutFileName	str	Yes	Output file path for generated content features or sentence collections
word2idx	dict	Yes	Mutable dictionary mapping word strings to integer indices, updated in place during processing
entity2idx	dict	Yes	Mutable dictionary mapping entity IDs to integer indices, updated in place during processing
field	list of str	No	Which fields to include, e.g. ["Title"] or ["Title", "Abstract"] (default: ["Title"])
doc_len	int	No	Fixed document length for padding or truncating feature vectors (default: 10)
InFile_RelatedFieldOfStudy	str	Yes	Path to file containing related field-of-study triples for knowledge graph construction
relation2idx	dict	Yes	Mutable dictionary mapping relation names to integer indices
author2paper_list	dict	Yes	Dictionary mapping author IDs to their published paper lists
paper2reference_list	dict	Yes	Dictionary mapping paper IDs to their reference (cited paper) lists
paper2date	dict	Yes	Dictionary mapping paper IDs to publication dates
file_Author2ReferencePapers	str	Yes	Path to author-to-reference-papers file for generating experiment splits
InFile_paper_feature	str	Yes	Path to paper feature file used to filter items that have features
tag	str	Yes	Tag string used for naming output files (e.g., "citeulike")
item_ratio	float	No	Fraction of items to sample for experiments (default: 1.0)
process_num	int	No	Number of parallel processes for negative sampling (default: 1)
InFile_PaperReference	str	Yes	Path to paper reference file for co-citation computation
norm	bool	No	Whether to apply normalization to co-citation and co-reference scores (default: True)

Outputs

Name	Type	Description
gen_paper_content return	tuple(dict, dict)	Updated word2idx and entity2idx dictionaries after processing all papers
get_author_reference_list return	dict	Dictionary mapping author IDs to temporally sorted lists of (paper_id, publish_date, cited_date) tuples
gen_paper_cocitation return	tuple(dict, dict)	Pair of dictionaries: co-citation counts and co-reference counts between paper pairs, optionally normalized
group_labels return	tuple(list, list)	Lists of grouped labels and predictions, divided by group keys for evaluation
gen_experiment_splits side effects	files	Writes train, valid, test, user_history, and item2freq files to OutFile_dir
prepare_dataset side effects	files	Writes lightgcn_train and lightgcn_valid files to output_folder

Usage Examples

Paper Content Generation

from utils.task_helper import gen_paper_content

word2idx = {}
entity2idx = {}

# Generate paper content features from title and abstract sentences
word2idx, entity2idx = gen_paper_content(
    InFile_PaperTitleAbs_bySentence="data/PaperTitleAbs_bySentence.txt",
    OutFileName="data/paper_content.txt",
    word2idx=word2idx,
    entity2idx=entity2idx,
    field=["Title", "Abstract"],
    doc_len=20,
)

Knowledge Graph Construction

from utils.task_helper import gen_knowledge_relations

entity2idx = {}
relation2idx = {}

gen_knowledge_relations(
    InFile_RelatedFieldOfStudy="data/RelatedFieldOfStudy.txt",
    OutFile_dirname="data/knowledge/",
    entity2idx=entity2idx,
    relation2idx=relation2idx,
)
# Produces: train2id.txt, entity2id.txt, relation2id.txt in data/knowledge/

Experiment Splits with Negative Sampling

from utils.task_helper import gen_experiment_splits

gen_experiment_splits(
    file_Author2ReferencePapers="data/Author2ReferencePapers.txt",
    OutFile_dir="data/splits/",
    InFile_paper_feature="data/paper_content.txt",
    tag="citeulike",
    item_ratio=1.0,
    process_num=4,
)
# Produces: train_citeulike.txt, valid_citeulike.txt, test_citeulike.txt

Embedding Preparation

from utils.task_helper import format_word_embeddings, gen_context_embedding

# Convert word embeddings to numpy format
format_word_embeddings(
    word_vecfile="data/word2vec.txt",
    word2id_file="data/word2idx.pkl",
    np_file="data/word_embeddings.npy",
)

# Generate context embeddings from knowledge graph neighbors
gen_context_embedding(
    entity_file="data/entity_embeddings.tsv",
    context_file="data/context_embeddings.tsv",
    kg_file="data/knowledge/train2id.txt",
    dim=100,
)

LightGCN Data Preparation

from utils.task_helper import prepare_dataset

prepare_dataset(
    output_folder="data/lightgcn/",
    input_folder="data/splits/",
    tag="citeulike",
)
# Produces: lightgcn_train_citeulike.txt, lightgcn_valid_citeulike.txt

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment