Module:Citation/CS1/Identifiers/sandbox: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

Revision as of 16:51, 29 July 2021 edit Trappist the monk (talk \| contribs) Administrators 494,442 edits No edit summary ← Previous edit		Latest revision as of 18:23, 1 August 2025 edit undo Trappist the monk (talk \| contribs) Administrators 494,442 edits No edit summary
(53 intermediate revisions by 8 users not shown)
Line 1: --[[ History of changes since last sync: ~~2021~~2025-04-1012 ~~2021~~2025-0506-2107: ~~add~~maint ~~support~~cat for post 2007 arxiv format without \|~~ssrn-access~~class=; see Help_talk:Citation_Style_1#~~ssrn~~Category%3ACS1_maint%3A_missing_class_%3F ~~2021~~2025-07-2813: ~~reworked~~tweak ~~error~~wikidata ~~messaging~~identifier article name fetch; see Help_talk:Citation_Style_1#~~error_messaging~~identifier_label_links 2025-08-01: fix 8-digit only medRxiv test; see Help_talk:Citation_Style_1#medRxiv_error_detection_flaw ]] Line 34 ⟶ 36: as an aid to internationalizing identifier-label wikilinks, gets identifier article names from Wikidata. returns w:<lang code>:<article title> when <q> has an <article title> for <lang code>; nil else. 'w:<lang code>' ensures that sister project (like wiktionary) will link to the <lang code>.wikipedia article. for identifiers that do not have <q>, returns nil for wikis that do not have mw.wikibase installed, returns nil Line 53 ⟶ 56: if wd_article then wd_article = table.concat ({'w:', this_wiki_code, ':', wd_article}); -- interwiki-style link without brackets if taken from WD; leading ~~colon~~'w:' required end Line 60 ⟶ 63: --[[--------------------------< L IA NB KE L _ L AI BN ~~E L~~K _ M A K E >------------------------------------------------ common function to create ~~identifier~~a link for an identifier label from handler table or from Wikidata returns the first available of: 1. redirect from local wiki's handler table (if enabled) 2. Wikidata sitelink to the local language wikipedia article (if there is a Wikidata entry for this identifier in the local ~~wiki's~~ language) 3. ~~label~~link to wikipedia article specified in the local wiki's handler table ]] local function ~~link_label_make~~label_link_make (handler) local wd_article; Line 108 ⟶ 111: return table.concat ({ make_wikilink (~~link_label_make~~label_link_make (options), options.label), -- redirect, Wikidata link, or locally specified link (in that order) options.separator or ' ', ext_link Line 130 ⟶ 133: return table.concat ( { make_wikilink (~~link_label_make~~label_link_make (options), options.label), -- wiki-link the identifier label options.separator or ' ', -- add the separator make_wikilink ( Line 173 ⟶ 176: --[=[-------------------------< I S _ V A L I D _ ~~B I O~~ R X I V _ D A T E >------------------------------------------ for biorxiv, returns true if: 2019-12-11T00:00Z <= biorxiv_date < today + 2 days for medrxiv, returns true if: 2020-01-01T00:00Z <= medrxiv_date < today + 2 days The dated form of biorxiv identifier has a start date of 2019-12-11. The Unix timestamp for that date is {{#time:U\|2019-12-11}} = 1576022400 The medrxiv identifier has a start date of 2020-01-01. The Unix timestamp for that date is {{#time:U\|2020-01-01}} = 1577836800 ~~biorxiv_date~~<rxiv_date> is the date provided in those \|biorxiv= parameter values that are dated and in \|medrxiv= parameter values at time 00:00:00 UTC <today> is the current date at time 00:00:00 UTC plus 48 hours if today's date is ~~2015~~2023-01-01T00:00:00 then adding 24 hours gives ~~2015~~2023-01-02T00:00:00 – one second more than today adding 24 hours gives ~~2015~~2023-01-03T00:00:00 – one second more than tomorrow inputs: ~~This function does not work if it is fed month names for languages other than English. Wikimedia #time: parser~~ <y>, <m>, <d> – year, month, day parts of the date from the birxiv or medrxiv identifier ~~apparently doesn't understand non-English date month names. This function will always return false when the date~~ <select> 'b' for biorxiv, 'm' for medrxiv; defaults to 'b' ~~contains a non-English month name because good1 is false after the call to lang.formatDate(). To get around that~~ ~~call this function with YYYY-MM-DD format dates.~~ ]=] local function ~~is_valid_biorxiv_date~~is_valid_rxiv_date (~~biorxiv_date~~y, m, d, select) if 0 == tonumber (m) and 12 < tonumber (m) then -- <m> must be a number 1–12 return false; end if 0 == tonumber (d) and 31 < tonumber (d) then -- <d> must be a number 1–31; TODO: account for month length and leap yer? return false; end local rxiv_date = table.concat ({y, m, d}, '-'); -- make ymd date string local good1, good2; local ~~biorxiv_ts~~rxiv_ts, tomorrow_ts; -- to hold Unix timestamps representing the dates local lang_object = mw.getContentLanguage(); good1, ~~biorxiv_ts~~rxiv_ts = pcall (lang_object.formatDate, lang_object, 'U', ~~biorxiv_date~~rxiv_date); -- convert ~~biorxiv_date~~rxiv_date value to Unix timestamp good2, tomorrow_ts = pcall (lang_object.formatDate, lang_object, 'U', 'today + 2 days' ); -- today midnight + 2 days is one second more than all day tomorrow if good1 and good2 then -- lang.formatDate() returns a timestamp in the local script which tonumber() may not understand ~~biorxiv_ts~~rxiv_ts = tonumber (~~biorxiv_ts~~rxiv_ts) or lang_object:parseFormattedNumber (~~biorxiv_ts~~rxiv_ts); -- convert to numbers for the comparison; tomorrow_ts = tonumber (tomorrow_ts) or lang_object:parseFormattedNumber (tomorrow_ts); else Line 208 ⟶ 221: end local limit_ts = ((select and ('m' == select)) and 1577836800) or 1576022400; -- choose the appropriate limit timesatmp ~~return ((1576022400 <= biorxiv_ts) and (biorxiv_ts < tomorrow_ts)) -- 2012-12-11T00:00Z <= biorxiv_date < tomorrow's date~~ return ((limit_ts <= rxiv_ts) and (rxiv_ts < tomorrow_ts)) -- limit_ts <= rxiv_date < tomorrow's date end Line 258 ⟶ 273: --[[--------------------------< N O R M A L I Z E _ L C C N >-------------------------------------------------- LCCN normalization (~~http~~https://www.loc.gov/marc/lccn-namespace.html#normalization) 1. Remove all blanks. 2. If there is a forward slash (/) in the string, remove it, and remove all characters to the right of the forward slash. Line 268 ⟶ 283: Returns a normalized LCCN for lccn() to validate. There is no error checking (step 3.b.1) performed in this function. ]] Line 287 ⟶ 303: return lccn; end Line 294 ⟶ 310: --[[--------------------------< A R X I V >-------------------------------------------------------------------- See: ~~http~~https://arxiv.org/help/arxiv_identifier format and error check arXiv identifier. There are three valid forms of the identifier: Line 319 ⟶ 335: <date code> and <version> are as defined for 0704-1412 <number> is a five-digit number ]] Line 373 ⟶ 390: if is_set (class) then if id:match ('^%d+') then text = table.concat ({text, ' [[https://arxiv.org/archive/', class, ' ', class, ']]'}); -- external link within square brackets, not wikilink else set_message ('err_class_ignored'); end else -- class not set if id:match ('^%d+') and options.CitationClass == 'arxiv' then -- new (post 2007) format; {{cite arxiv}} only set_message ('maint_missing_class'); -- add maint cat end end Line 387 ⟶ 408: Validates (sort of) and formats a bibcode ID. Format for bibcodes is specified here: ~~http~~https://adsabs.harvard.edu/abs_doc/help_pages/data.html#bibcodes But, this: 2015arXiv151206696F is apparently valid so apparently, the only things that really matter are length, 19 characters Line 405 ⟶ 426: local access = options.access; local handler = options.handler; local ignore_invalid = options.accept; local err_type; local err_msg = ''; Line 427 ⟶ 449: if id:find('&%.') then err_type = cfg.err_msg_supl.journal; -- journal abbreviation must not have '&.' (if it does it's missing a letter) end if id:match ('.........%.tmp%.') then -- temporary bibcodes when positions 10–14 are '.tmp.' set_message ('maint_bibcode'); end end end if is_set (err_type) and not ignore_invalid then -- if there was an error detected and accept-as-written markup not used set_message ('err_bad_bibcode', {err_type}); options.coins_list_t['BIBCODE'] = nil; -- when error, unset so not included in COinS end Line 462 ⟶ 486: local patterns = { '^10%.1101/%d%d%d%d%d%d$', -- simple 6-digit identifier (before 2019-12-11) '^10%.1101/(20~~[1-9]~~%d%d)%.(~~[01]~~%d%d)%.(~~[0-3]~~%d%d)%.%d%d%d%d%d%dv%d+$', -- y.m.d. date + 6-digit identifier + version (after 2019-12-11) '^10%.1101/(20~~[1-9]~~%d%d)%.(~~[01]~~%d%d)%.(~~[0-3]~~%d%d)%.%d%d%d%d%d%d$', -- y.m.d. date + 6-digit identifier (after 2019-12-11) } Line 472 ⟶ 496: if m then -- m is nil when id is the six-digit form if not ~~is_valid_biorxiv_date~~is_valid_rxiv_date (y ~~.. '-' ..~~, m, ..d, '-b' ~~.. d~~) then -- validate the encoded date; ~~TODO: don~~'tb' ~~ignore~~for ~~leap-year~~biorxiv ~~and actual month lengths ({{#time:}} is a poor date validator)~~limit break; -- date fail; break out early so we don't unset the error message end Line 479 ⟶ 503: break; -- and done end end -- err_cat remains set here when no match if err_msg then Line 497 ⟶ 521: The description of the structure of this identifier can be found at Help_talk:Citation_Style_1/Archive_26#CiteSeerX_id_structure ]] Line 532 ⟶ 557: and terminal punctuation may not be technically correct but it appears, that in practice these characters are rarely if ever used in DOI names. https://www.doi.org/doi_handbook/2_Numbering.html -- 2.2 Syntax of a DOI name https://www.doi.org/doi_handbook/2_Numbering.html#2.2.2 -- 2.2.2 DOI prefix ]] Line 542 ⟶ 570: local handler = options.handler; local err_flag; local function is_extended_free (registrant, suffix) -- local function to check those few registrants that are mixed; identifiable by the doi suffix <incipit> if cfg.extended_registrants_t[registrant] then -- if this registrant has known free-to-read extentions for _, incipit in ipairs (cfg.extended_registrants_t[registrant]) do -- loop through the registrant's incipits if mw.ustring.find (suffix, '^' .. incipit) then -- if found return true; end end end end local text; if is_set (inactive) then local inactive_year = inactive:match("%d%d%d%d") ~~or ''~~; -- try to get the year portion from the inactive date local inactive_month, good; Line 556 ⟶ 594: end end end -- otherwise, \|doi-broken-date= has something but it isn't a date ~~else~~ ~~inactive_year = nil; -- \|doi-broken-date= has something but it isn't a date~~ ~~end~~ if is_set (inactive_year) and is_set (inactive_month) then Line 570 ⟶ 606: end local suffix; ~~local registrant = id:match ('^10%.([^/]+)/[^%s–]-[^%.,]$'); -- registrant set when DOI has the proper basic form~~ local registrant, suffix = mw.ustring.match (id, '^10%.([^/]+)/([^%s–]-[^%.,])$'); -- registrant and suffix set when DOI has the proper basic form local registrant_err_patterns = { -- these patterns are for code ranges that are not supported '^[^1-3]%d%d%d%d%.%d%d+$', -- 5 digits with subcode (0xxxx, 40000+); accepts: 10000–39999 '^[^1-57]%d%d%d%d$', -- 5 digits without subcode (0xxxx, 60000+); accepts: ~~10000–59999~~10000–69999 '^[^1-9]%d%d%d%.%d%d+$', -- 4 digits with subcode (0xxx); accepts: 1000–9999 '^[^1-9]%d%d%d$', -- 4 digits without subcode (0xxx); accepts: 1000–9999 '^%d%d%d%d%d%d+', -- 6 or more digits '^%d%d?%d?$', -- less than 4 digits without subcode (3 digits with subcode is legitimate) '^%d%d?%.[%d%.]+', -- 1 or 2 digits with subcode '^5555$', -- test registrant will never resolve '[^%d%.]', -- any character that isn't a digit or a dot Line 600 ⟶ 638: if err_flag then options.coins_list_t['DOI'] = nil; -- when error, unset so not included in COinS else if not access and (cfg.known_free_doi_registrants_t[registrant] or is_extended_free (registrant, suffix)) then -- \|doi-access=free not set and <registrant> is known to be free set_message ('maint_doi_unflagged_free'); -- set a maint cat end end Line 626 ⟶ 668: if ever used in HDLs. Query string parameters are named here: ~~http~~https://www.handle.net/proxy_servlet.html. query strings are not displayed but since '?' is an allowed character in an HDL, '?' followed by one of the query parameters is the only way we have to detect the query string so that it isn't URL-encoded with the rest of the identifier. Line 636 ⟶ 678: local access = options.access; local handler = options.handler; local query_params = { -- list of known query parameters from ~~http~~https://www.handle.net/proxy_servlet.html 'noredirect', 'ignore_aliases', Line 684 ⟶ 726: ]] local function isbn (~~options~~options_t) local isbn_str = ~~options~~options_t.id; local ignore_invalid = ~~options~~options_t.accept; local handler = ~~options~~options_t.handler; local year = options_t.Year; -- when set, valid anchor_year; may have a disambiguator which must be removed local function return_result (check, err_type) -- local function to handle the various returns Line 696 ⟶ 739: else -- here when not ignoring if not check then -- and there is an error ~~options~~options_t.coins_list_t['ISBN'] = nil; -- when error, unset so not included in COinS set_message ('err_bad_isbn', err_type); -- set an error message return ISBN; -- return id text Line 702 ⟶ 745: end return ISBN; -- return id text end if year and not ignore_invalid then -- year = year:match ('%d%d%d%d?'); -- strip disambiguator if present if 1965 > tonumber(year) then set_message ('err_invalid_isbn_date'); -- set an error message return internal_link_id ({link = handler.link, label = handler.label, redirect = handler.redirect, prefix = handler.prefix, id = isbn_str, separator = handler.separator}); end end Line 805 ⟶ 857: Determines whether an ISMN string is valid. Similar to ISBN-13, ISMN is 13 digits beginning 979-0-... and uses the same check digit calculations. See ~~http~~https://www.ismn-international.org/download/Web_ISMN_Users_Manual_2008-6.pdf section 2, pages 9–12. Line 834 ⟶ 886: text = table.concat ( -- because no place to link to yet { make_wikilink (~~link_label_make~~label_link_make (handler), handler.label), handler.separator, id_copy Line 854 ⟶ 906: like this: \|issn=0819 4327 gives: [~~http~~https://www.worldcat.org/issn/0819 4327 0819 4327] -- can't have spaces in an external link This code now prevents that by inserting a hyphen at the ISSN midpoint. It also validates the ISSN for length Line 958 ⟶ 1,010: Format LCCN link and do simple error checking. LCCN is a character string 8-12 characters long. The length of the LCCN dictates the character type of the first 1-3 characters; the rightmost eight are always digits. https://oclc-research.github.io/infoURI-Frozen/info-uri.info/info:lccn/reg.html ~~http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:lccn/~~ length = 8 then all digits Line 1,013 ⟶ 1,065: return external_link_id ({link = handler.link, label = handler.label, q = handler.q, redirect = handler.redirect, prefix = handler.prefix, id = lccn, separator = handler.separator, encode = handler.encode}); end --[[--------------------------< M E D R X I V >----------------------------------------------------------------- Format medRxiv ID and do simple error checking. Similar to later bioRxiv IDs, medRxiv IDs are prefixed with a yyyy.mm.dd. date and suffixed with an optional version identifier. Ealiest date accepted is 2020.01.01 The medRxiv ID is a date followed by an eight-digit number followed by an optional version indicator 'v' and one or more digits: https://www.medrxiv.org/content/10.1101/2020.11.16.20232009v2 -> 10.1101/2020.11.16.20232009v2 ]] local function medrxiv (options) local id = options.id; local handler = options.handler; local err_msg_flag = true; -- flag; assume that there will be an error local patterns = { '^%d%d%d%d%d%d%d%d$', -- simple 8-digit identifier; these should be relatively rare '^10%.1101/(20%d%d)%.(%d%d)%.(%d%d)%.%d%d%d%d%d%d%d%dv%d+$', -- y.m.d. date + 8-digit identifier + version (2020-01-01 and later) '^10%.1101/(20%d%d)%.(%d%d)%.(%d%d)%.%d%d%d%d%d%d%d%d$', -- y.m.d. date + 8-digit identifier (2020-01-01 and later) } for _, pattern in ipairs (patterns) do -- spin through the patterns looking for a match if id:match (pattern) then local y, m, d = id:match (pattern); -- found a match, attempt to get year, month and date from the identifier if m then -- m is nil when id is the 8-digit form if not is_valid_rxiv_date (y, m, d, 'b') then -- validate the encoded date; 'b' for medrxiv limit break; -- date fail; break out early so we don't unset the error message end end err_msg_flag = nil; -- we found a match so unset the error message break; -- and done end end -- <err_msg_flag> remains set here when no match if err_msg_flag then options.coins_list_t['MEDRXIV'] = nil; -- when error, unset so not included in COinS set_message ('err_bad_medrxiv'); -- and set the error message end return external_link_id ({link = handler.link, label = handler.label, q = handler.q, redirect = handler.redirect, prefix = handler.prefix, id = id, separator = handler.separator, encode = handler.encode, access = handler.access}); end Line 1,074 ⟶ 1,172: elseif id:match('^%d+$') then -- no prefix number = id; -- get the number if 10tonumber ~~< number:len~~(id) > handler.id_limit then number = nil; -- ~~constrain~~unset towhen ~~1 to 10 digits; change this when~~id ~~OCLC~~value ~~issues~~exceeds ~~11-digit~~the ~~numbers~~limit end end Line 1,207 ⟶ 1,305: text = table.concat ( -- still embargoed so no external link { make_wikilink (~~link_label_make~~label_link_make (handler), handler.label), handler.separator, id, Line 1,536 ⟶ 1,634: ['JSTOR'] = jstor, ['LCCN'] = lccn, ['MEDRXIV'] = medrxiv, ['MR'] = mr, ['OCLC'] = oclc, Line 1,559 ⟶ 1,658: options_t.handler = cfg.id_handlers[hkey]; options_t.coins_list_t = ID_list_coins_t; -- pointer to ID_list_coins_t; for \|asin= and \|ol=; also to keep erroneous values out of the citation's metadata options_t.coins_list_t[hkey] = v; -- id value without accept-as-written markup for metadata if options_t.handler.access and not in_array (options_t.handler.access, cfg.keywords_lists['id-access']) then Line 1,617 ⟶ 1,717: ]] local function identifier_lists_get (~~args~~args_t, options_t, ID_support_t) local ID_list_coins_t = extract_ids (~~args~~args_t); -- get a table of identifiers and their values for use locally and for use in COinS options_check (ID_list_coins_t, ID_support_t); -- ID support parameters must have matching identifier parameters local ID_access_levels_t = extract_id_access_levels (~~args~~args_t, ID_list_coins_t); -- get a table of identifier access levels local ID_list_t = build_id_list (ID_list_coins_t, options_t, ID_access_levels_t); -- get a sequence table of rendered identifier strings Line 1,654 ⟶ 1,754: auto_link_urls = auto_link_urls, -- table of identifier URLs to be used when auto-linking \|title= identifier_lists_get = identifier_lists_get, -- experiment to replace individual calls to build_id_list(), extract_ids, extract_id_access_levels is_embargoed = is_embargoed; set_selected_modules = set_selected_modules;