компания → интернет-математика
Task and Datasets
The dataset includes user sessions extracted from Yandex logs, with queries, URL rankings and clicks. Unlike previous click datasets, it also includes relevance judgments for the ranked URLs, for the purposes of training relevance prediction models. To allay privacy concerns the user data is fully anonymized. So, only meaningless numeric IDs of queries, sessions, and URLs are released. The queries are grouped only by sessions and no user IDs are provided. The dataset consists of several parts.
Noteworthy characteristics of the dataset:
The logs are about two years old and do not contain queries with commercial intent detected with Yandex proprietary classifier.
The log represents a stream of user actions, with each line representing either a query or a click:
SessionID TimePassed TypeOfAction QueryID RegionID ListOfURLs
SessionID TimePassed TypeOfAction URLID
SessionID is the unique identifier of a query session.
TimePassed is the time passed since the start of the session with the SessionID in units of time. We do not disclose how many milliseconds are in one unit of time.
TypeOfAction is the type of the action. It’s either a query (Q) or a click (C).
QueryID is the unique identifier of a query.
RegionID is the unique identifier of the country the user is querying from. We included this identifier, because rankings and relevance labels might depend on the country of the user (e.g. for the query “Ministry of Foreign Affairs”). There are 4 possible identifiers (integers from 0 to 3).
URLID is the unique identifier of an URL
ListOfURLs is the list of URLIDs ordered from left to right as they were shown to the user from the top to the bottom.
10989856 0 Q 10364965 2 671723 21839763 3840421 180513 45660210 514963 41484044 3153206 1439919 4991367 10989856 103 C 21839763 10989856 955 Q 1009161 2 197515 197539 11 179526 5859272 1624306 1587784 1624296 5859294 2186374 10989856 960 C 197515
Labels were assigned by Yandex judges to a subset of URLs appearing in the logs. Labels are binary: Relevant (1) and Irrelevant (0). Labels were assigned during the course of a year after the logs had been collected. URLs were judged based not only on the text of the query, but also on the region of the user, if it was necessary, but not in every case. So, the presence of the RegionID does not necessarily mean that the relevance is region-specific. This is up to participants to decide based on user logs.
So, each line in the shared text file containing labels has the following format:
QueryID RegionID URLID RelevanceLabel
QueryID is the unique identifier of a query.
RegionID is the unique identifier of the country the supposed user is querying from.
URLID is the unique identifier of an URL.
RelevanceLabel is the relevance label (0 or 1).
1209161 2 5839294 1 1209161 2 1912415 1 1209161 2 1621201 1 1209161 2 1111 0
The task is to predict labels of documents for the given test set of queries, using the shared dataset containing the search log and queries with labeled URLs. The search log is supposed to be used both for training the prediction models and for prediction of the labels for the test set of queries.
Submissions will be evaluated using AUC (Area Under Curve) measure which will be calculated using the ranking of URLIDs provided by participants for each query and then averaged over queries. Only judged documents will be considered (so, all unjudged documents will be ignored during the evaluation). AUC is a popular measure used for evaluation of predictors and represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.
It is important that participants provide the description (from 150 to 700 characters) of the automated method they used for preparing every submission, which has to be uploaded with the submission. Organizers reserve the right to cancel the registration of the participants who do not provide or provide clearly meaningless descriptions for their submissions.
The test set is split into two subsets. The first subset is used to rank participants on the Leaderboard and the second subset is used to determine the winners of the contest. That also means that the team rankings on the Leaderboard might differ from the final results calculated on the second subset. The split between subsets (including their sizes) is not disclosed. The winner candidates will be announced soon after the end of the Challenge based on the evaluation on the second subset.
As long as only ranking of predictions is important for calculating AUC, we ask to submit just a list of URLIDs ranked from left to right by their probability of being relevant in the descending order. The submissions should represent ASCII text files containing the following tab-delimited lines for each query from the test set.
QueryID RegionID URLID URLID URLID URLID URLID URLID URLID URLID
Any URLIDs, which appeared in the log at least once in response to the specific QueryID-RegionID pair, can be included into the ranked list for this pair. It means that all other URLIDs that have never been returned by the search engine for this pair will be ignored (as they are not labeled anyway). Those labeled URLIDs that are not included into the ranked list for the specific QueryID-RegionID pair will be automatically added to the end of the provided ranking in the worst possible order (first irrelevant URLs, then relevant URLs).
Participants are allowed to upload as many submissions as they want over the course of the Challenge, but not more than one submission in every 2 hours. Only the last submission will be considered for evaluation, on which the live public rating is also based.
Please, note that the training set also contains queries with URL labels of only one kind (relevant or irrelevant) and queries with more than 300 unique URLs. However, we deliberately removed such queries from the test set.
The archive with the dataset contains three files:
Clicklog.txt – the click log,
Trainq.txt – training queries with labels for judged urls,
Testq.txt – the list of test queries. Each submission should provide predictions (rankings) for each query from this list and for the queries only from this list.
Prizes and Winners
At the end of the competition, the last entries from all participants will be ranked in decreasing order of their respective scores (defined above). The top three competitors will receive the following cash prizes:
1st place: $5,000
2nd place: $3,000
3rd place: $1,000
In the case where two or more submissions achieve the same score, the result received first will prevail.
As a condition for receiving any prize, the prospective winning teams are required to submit a manuscript describing the winning algorithm and methods used to generate the output by 20 January 2012. The manuscript should adhere to all guidelines stated in the contest rules. In addition it should be written as a high quality ACM workshop paper, within an 8-page double column ACM SIG proceedings format (Tighter Alternate Style). In case more space is needed, authors can use appendices.
We also expect winners to present a brief talk describing their winning method at an upcoming WSCD workshop on 12 February, 2012 in Seattle, USA. Other leading teams might be also invited to present at the workshop, based on the description of their results. In that case, they will be notified shortly after the end of the Challenge. All workshop participants will be responsible for their own transportation, lodging and workshop registration, as these costs are not included in the prize packages.