Поиск Почта Карты Маркет Новости Словари Блоги Видео Картинки
компания → интернет-математика
Войти

Task and Datasets

Dataset

The dataset includes user sessions extracted from Yandex logs, with queries, URL rankings and clicks. Unlike previous click datasets, it also includes relevance judgments for the ranked URLs, for the purposes of training relevance prediction models. To allay privacy concerns the user data is fully anonymized. So, only meaningless numeric IDs of queries, sessions, and URLs are released. The queries are grouped only by sessions and no user IDs are provided. The dataset consists of several parts.

Noteworthy characteristics of the dataset:

  • Unique queries: 30,717,251
  • Unique urls: 117,093,258
  • Sessions: 43,977,859
  • Total records in the log: 340,796,067
  • Assessed query-region-url triples for the total query set (training + test): 71,930
  • Query-region pairs with assesed urls (training + test): 8,410
  • The logs are about two years old and do not contain queries with commercial intent detected with Yandex proprietary classifier.

    User log

    The log represents a stream of user actions, with each line representing either a query or a click:

    Query action:

    SessionID TimePassed TypeOfAction QueryID RegionID ListOfURLs

    Click action:

    SessionID TimePassed TypeOfAction URLID

    SessionID is the unique identifier of a query session.

    TimePassed is the time passed since the start of the session with the SessionID in units of time. We do not disclose how many milliseconds are in one unit of time.

    TypeOfAction is the type of the action. It’s either a query (Q) or a click (C).

    QueryID is the unique identifier of a query.

    RegionID is the unique identifier of the country the user is querying from. We included this identifier, because rankings and relevance labels might depend on the country of the user (e.g. for the query “Ministry of Foreign Affairs”). There are 4 possible identifiers (integers from 0 to 3).

    URLID is the unique identifier of an URL

    ListOfURLs is the list of URLIDs ordered from left to right as they were shown to the user from the top to the bottom.

    Example:

    10989856    0   Q   10364965 2  671723  21839763  3840421  180513  45660210  514963  41484044  3153206  1439919  4991367
    10989856    103 C   21839763
    10989856    955 Q   1009161  2  197515  197539  11  179526  5859272  1624306  1587784  1624296  5859294  2186374
    10989856    960 C   197515
    

    Relevance labels

    Labels were assigned by Yandex judges to a subset of URLs appearing in the logs. Labels are binary: Relevant (1) and Irrelevant (0). Labels were assigned during the course of a year after the logs had been collected. URLs were judged based not only on the text of the query, but also on the region of the user, if it was necessary, but not in every case. So, the presence of the RegionID does not necessarily mean that the relevance is region-specific. This is up to participants to decide based on user logs.

    So, each line in the shared text file containing labels has the following format:

    QueryID RegionID URLID RelevanceLabel

    QueryID is the unique identifier of a query.

    RegionID is the unique identifier of the country the supposed user is querying from.

    URLID is the unique identifier of an URL.

    RelevanceLabel is the relevance label (0 or 1).

    Example:

    1209161 2 5839294  1
    1209161 2 1912415  1
    1209161 2 1621201  1
    1209161 2 1111     0
    

    Task

    The task is to predict labels of documents for the given test set of queries, using the shared dataset containing the search log and queries with labeled URLs. The search log is supposed to be used both for training the prediction models and for prediction of the labels for the test set of queries.

    Measure

    Submissions will be evaluated using AUC[1] (Area Under Curve) measure which will be calculated using the ranking of URLIDs provided by participants for each query and then averaged over queries. Only judged documents will be considered (so, all unjudged documents will be ignored during the evaluation). AUC is a popular measure used for evaluation of predictors and represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

    Submissions

    It is important that participants provide the description (from 150 to 700 characters) of the automated method they used for preparing every submission, which has to be uploaded with the submission. Organizers reserve the right to cancel the registration of the participants who do not provide or provide clearly meaningless descriptions for their submissions.

    The test set is split into two subsets. The first subset is used to rank participants on the Leaderboard and the second subset is used to determine the winners of the contest. That also means that the team rankings on the Leaderboard might differ from the final results calculated on the second subset. The split between subsets (including their sizes) is not disclosed. The winner candidates will be announced soon after the end of the Challenge based on the evaluation on the second subset.

    As long as only ranking of predictions is important for calculating AUC, we ask to submit just a list of URLIDs ranked from left to right by their probability of being relevant in the descending order. The submissions should represent ASCII text files containing the following tab-delimited lines for each query from the test set.

    QueryID RegionID URLID URLID URLID URLID URLID URLID URLID URLID
    

    Any URLIDs, which appeared in the log at least once in response to the specific QueryID-RegionID pair, can be included into the ranked list for this pair. It means that all other URLIDs that have never been returned by the search engine for this pair will be ignored (as they are not labeled anyway). Those labeled URLIDs that are not included into the ranked list for the specific QueryID-RegionID pair will be automatically added to the end of the provided ranking in the worst possible order (first irrelevant URLs, then relevant URLs).

    Participants are allowed to upload as many submissions as they want over the course of the Challenge, but not more than one submission in every 2 hours. Only the last submission will be considered for evaluation, on which the live public rating is also based.


    Please, note that the training set also contains queries with URL labels of only one kind (relevant or irrelevant) and queries with more than 300 unique URLs. However, we deliberately removed such queries from the test set.


    The archive with the dataset contains three files:

    Clicklog.txt – the click log,

    Trainq.txt – training queries with labels for judged urls,

    Testq.txt – the list of test queries. Each submission should provide predictions (rankings) for each query from this list and for the queries only from this list.

    Prizes and Winners

    At the end of the competition, the last entries from all participants will be ranked in decreasing order of their respective scores (defined above). The top three competitors will receive the following cash prizes:

    1st place: $5,000

    2nd place: $3,000

    3rd place: $1,000

    In the case where two or more submissions achieve the same score, the result received first will prevail.

    As a condition for receiving any prize, the prospective winning teams are required to submit a manuscript describing the winning algorithm and methods used to generate the output by 20 January 2012. The manuscript should adhere to all guidelines stated in the contest rules. In addition it should be written as a high quality ACM workshop paper, within an 8-page double column ACM SIG proceedings format (Tighter Alternate Style). In case more space is needed, authors can use appendices.

    We also expect winners to present a brief talk describing their winning method at an upcoming WSCD workshop on 12 February, 2012 in Seattle, USA. Other leading teams might be also invited to present at the workshop, based on the description of their results. In that case, they will be notified shortly after the end of the Challenge. All workshop participants will be responsible for their own transportation, lodging and workshop registration, as these costs are not included in the prize packages.



    [1]AUC: a Statistically Consistent and more Discriminating Measure than Accuracy. Charles X. Ling, Jin Huang, Harry Zhang. 2001. [pdf]