Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Recommenders team Recommenders KDD2020 Task Helper

From Leeroopedia


Knowledge Sources
Domains Data Preprocessing, Knowledge Graphs, Academic Recommendation
Last Updated 2026-02-10 00:00 GMT

Overview

The KDD2020 Task Helper module provides the complete data processing pipeline for the KDD 2020 tutorial, transforming raw Microsoft Academic Graph (MAG) data into model-ready training files for DKN and LightGCN recommendation experiments.

Description

This module contains over 20 functions that collectively orchestrate the data preparation workflow for academic paper recommendation. The pipeline spans several key stages:

Paper content generation is handled by gen_paper_content, which reads tab-separated paper title and abstract sentences, converts words and entities to integer indices using dictionaries (word2idx and entity2idx), and writes fixed-length feature vectors to output files. The convert2id and parse_entities functions support this by mapping sentence tokens and field-of-study entity annotations to their respective indices.

Knowledge graph construction is performed by gen_knowledge_relations, which processes related field-of-study files to create entity-relation triples in the TransE training format (head, tail, relation), outputting train2id.txt, entity2id.txt, and relation2id.txt files.

User behavior extraction is handled by get_author_reference_list, which builds author citation histories by tracing which papers each author cited and when, producing temporally ordered reference lists. The gen_experiment_splits function then creates train/valid/test splits with configurable negative sampling using multiprocess parallelism for efficiency.

Paper similarity computation is implemented in gen_paper_cocitation, which calculates both co-citation and co-reference counts between papers and applies optional normalization based on citation list lengths, with configurable minimum thresholds.

Embedding preparation functions (format_word_embeddings, format_knowledge_embeddings, gen_context_embedding) convert text-based embedding files into NumPy arrays, and build context embeddings by averaging knowledge graph neighbor vectors.

LightGCN data preparation is performed by prepare_dataset and its helpers, which load instance files, expand user behavior histories, and write formatted training and validation files for the LightGCN model.

Usage

Use this module when following the KDD 2020 tutorial notebooks (steps 1 through 5) for building academic paper recommendation systems. It is designed for processing Microsoft Academic Graph data and preparing it for both knowledge-aware (DKN) and graph-based (LightGCN) recommendation models. The functions are typically called from Jupyter notebooks in sequence to build the complete data pipeline.

Code Reference

Source Location

Signature

def gen_paper_content(
    InFile_PaperTitleAbs_bySentence,
    OutFileName,
    word2idx,
    entity2idx,
    field=["Title"],
    doc_len=10,
): ...

def parse_entities(fieldOfStudy, entity2idx, cnt): ...

def convert2id(sentence, fieldOfStudy, word2idx, entity2idx): ...

def gen_knowledge_relations(
    InFile_RelatedFieldOfStudy, OutFile_dirname, entity2idx, relation2idx
): ...

def gen_indexed_sentence_collection(
    InFile_PaperTitleAbs_bySentence, OutFileName, word2idx
): ...

def gen_sentence_collection(
    InFile_PaperTitleAbs_bySentence, OutFileName, word2idx
): ...

def get_author_reference_list(
    author2paper_list, paper2reference_list, paper2date
): ...

def gen_experiment_splits(
    file_Author2ReferencePapers,
    OutFile_dir,
    InFile_paper_feature,
    tag,
    item_ratio=1.0,
    process_num=1,
): ...

def gen_paper_cocitation(InFile_PaperReference, norm=True): ...

def format_word_embeddings(word_vecfile, word2id_file, np_file): ...

def format_knowledge_embeddings(transE_vecfile, np_file): ...

def gen_context_embedding(entity_file, context_file, kg_file, dim): ...

def prepare_dataset(output_folder, input_folder, tag): ...

def group_labels(labels, preds, group_keys): ...

Import

from utils.task_helper import (
    gen_paper_content,
    gen_knowledge_relations,
    gen_experiment_splits,
    gen_paper_cocitation,
    format_word_embeddings,
    format_knowledge_embeddings,
    gen_context_embedding,
    prepare_dataset,
    group_labels,
)

I/O Contract

Inputs

Name Type Required Description
InFile_PaperTitleAbs_bySentence str Yes Path to tab-separated file containing paper ID, category, position, sentence text, and field-of-study annotations
OutFileName str Yes Output file path for generated content features or sentence collections
word2idx dict Yes Mutable dictionary mapping word strings to integer indices, updated in place during processing
entity2idx dict Yes Mutable dictionary mapping entity IDs to integer indices, updated in place during processing
field list of str No Which fields to include, e.g. ["Title"] or ["Title", "Abstract"] (default: ["Title"])
doc_len int No Fixed document length for padding or truncating feature vectors (default: 10)
InFile_RelatedFieldOfStudy str Yes Path to file containing related field-of-study triples for knowledge graph construction
relation2idx dict Yes Mutable dictionary mapping relation names to integer indices
author2paper_list dict Yes Dictionary mapping author IDs to their published paper lists
paper2reference_list dict Yes Dictionary mapping paper IDs to their reference (cited paper) lists
paper2date dict Yes Dictionary mapping paper IDs to publication dates
file_Author2ReferencePapers str Yes Path to author-to-reference-papers file for generating experiment splits
InFile_paper_feature str Yes Path to paper feature file used to filter items that have features
tag str Yes Tag string used for naming output files (e.g., "citeulike")
item_ratio float No Fraction of items to sample for experiments (default: 1.0)
process_num int No Number of parallel processes for negative sampling (default: 1)
InFile_PaperReference str Yes Path to paper reference file for co-citation computation
norm bool No Whether to apply normalization to co-citation and co-reference scores (default: True)

Outputs

Name Type Description
gen_paper_content return tuple(dict, dict) Updated word2idx and entity2idx dictionaries after processing all papers
get_author_reference_list return dict Dictionary mapping author IDs to temporally sorted lists of (paper_id, publish_date, cited_date) tuples
gen_paper_cocitation return tuple(dict, dict) Pair of dictionaries: co-citation counts and co-reference counts between paper pairs, optionally normalized
group_labels return tuple(list, list) Lists of grouped labels and predictions, divided by group keys for evaluation
gen_experiment_splits side effects files Writes train, valid, test, user_history, and item2freq files to OutFile_dir
prepare_dataset side effects files Writes lightgcn_train and lightgcn_valid files to output_folder

Usage Examples

Paper Content Generation

from utils.task_helper import gen_paper_content

word2idx = {}
entity2idx = {}

# Generate paper content features from title and abstract sentences
word2idx, entity2idx = gen_paper_content(
    InFile_PaperTitleAbs_bySentence="data/PaperTitleAbs_bySentence.txt",
    OutFileName="data/paper_content.txt",
    word2idx=word2idx,
    entity2idx=entity2idx,
    field=["Title", "Abstract"],
    doc_len=20,
)

Knowledge Graph Construction

from utils.task_helper import gen_knowledge_relations

entity2idx = {}
relation2idx = {}

gen_knowledge_relations(
    InFile_RelatedFieldOfStudy="data/RelatedFieldOfStudy.txt",
    OutFile_dirname="data/knowledge/",
    entity2idx=entity2idx,
    relation2idx=relation2idx,
)
# Produces: train2id.txt, entity2id.txt, relation2id.txt in data/knowledge/

Experiment Splits with Negative Sampling

from utils.task_helper import gen_experiment_splits

gen_experiment_splits(
    file_Author2ReferencePapers="data/Author2ReferencePapers.txt",
    OutFile_dir="data/splits/",
    InFile_paper_feature="data/paper_content.txt",
    tag="citeulike",
    item_ratio=1.0,
    process_num=4,
)
# Produces: train_citeulike.txt, valid_citeulike.txt, test_citeulike.txt

Embedding Preparation

from utils.task_helper import format_word_embeddings, gen_context_embedding

# Convert word embeddings to numpy format
format_word_embeddings(
    word_vecfile="data/word2vec.txt",
    word2id_file="data/word2idx.pkl",
    np_file="data/word_embeddings.npy",
)

# Generate context embeddings from knowledge graph neighbors
gen_context_embedding(
    entity_file="data/entity_embeddings.tsv",
    context_file="data/context_embeddings.tsv",
    kg_file="data/knowledge/train2id.txt",
    dim=100,
)

LightGCN Data Preparation

from utils.task_helper import prepare_dataset

prepare_dataset(
    output_folder="data/lightgcn/",
    input_folder="data/splits/",
    tag="citeulike",
)
# Produces: lightgcn_train_citeulike.txt, lightgcn_valid_citeulike.txt

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment