Difference between revisions of "Workdocumentation 2021-03-30"
Jump to navigation
Jump to search
Musaab Khan (talk | contribs) |
Musaab Khan (talk | contribs) |
||
(12 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
== Broken redirects == | == Broken redirects == | ||
+ | * {{Ticket |106| Red links, Natural Language Ids, ambigous, wrong and N/A values for country/region/city (and other fields)}} | ||
*The list of all broken redirects on openresearch can be found here: https://www.openresearch.org/mediawiki/index.php?title=Special:BrokenRedirects&limit=500&offset=0 | *The list of all broken redirects on openresearch can be found here: https://www.openresearch.org/mediawiki/index.php?title=Special:BrokenRedirects&limit=500&offset=0 | ||
== Broken file links == | == Broken file links == | ||
+ | * {{Ticket|106| red links, Natural Language Ids, ambigous, wrong and N/A values for country/region/city (and other fields) }} | ||
* List of all broken links which contain a file path but the file does not exist can be found here: https://www.openresearch.org/wiki/Category:Pages_with_broken_file_links | * List of all broken links which contain a file path but the file does not exist can be found here: https://www.openresearch.org/wiki/Category:Pages_with_broken_file_links | ||
Line 10: | Line 12: | ||
* The property Property:Has improper value for stores all invalid values. | * The property Property:Has improper value for stores all invalid values. | ||
===Ordinals === | ===Ordinals === | ||
+ | * {{Ticket|119|Text Values in Ordinal Field}} | ||
* For the Fix for ordinals the following approaches can be used : | * For the Fix for ordinals the following approaches can be used : | ||
** This approach finds and edits all events with improper ordinals fixed: | ** This approach finds and edits all events with improper ordinals fixed: | ||
Line 24: | Line 27: | ||
=== Improper Null values for Has person === | === Improper Null values for Has person === | ||
+ | * {{Ticket|150 |Improper Null Value handling}} | ||
* Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name. | * Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name. | ||
# First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep. | # First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep. | ||
Line 96: | Line 100: | ||
output result: [https://confident.dbis.rwth-aachen.de/ormk/index.php?title=ACCV_2020&type=revision&diff=19177&oldid=17957 ACCV_2020] | output result: [https://confident.dbis.rwth-aachen.de/ormk/index.php?title=ACCV_2020&type=revision&diff=19177&oldid=17957 ACCV_2020] | ||
+ | |||
+ | Main Function for usage: | ||
+ | <source lang='python'> | ||
+ | if __name__ == "__main__": | ||
+ | parser = argparse.ArgumentParser() | ||
+ | parser.add_argument('-p', dest="path", help='path to the file') | ||
+ | parser.add_argument('-stdin', dest="pipein", action='store_true', help='Use the input from STD IN using pipes') | ||
+ | parser.add_argument('-ro', dest="ro", action='store_true', help='Output to STDOUT for wikirestore by pipe') | ||
+ | parser.add_argument('-rmf', dest="rmf", action='store_true', help='flag to remove free text from wiki file') | ||
+ | parser.add_argument('-rdf', dest="rdf", action='store_true', help='flag to remove duplicate null value fields from wiki file') | ||
+ | args = parser.parse_args() | ||
+ | if args.pipein: | ||
+ | if args.rmf: | ||
+ | allx = sys.stdin.readlines() | ||
+ | for i in allx: | ||
+ | remove_FreeText(i.strip(), restoreOut=args.ro) | ||
+ | if args.rdf: | ||
+ | allx = sys.stdin.readlines() | ||
+ | for i in allx: | ||
+ | remove_DuplicateFields(i.strip(), restoreOut=args.ro) | ||
+ | elif args.path is not None: | ||
+ | if args.rmf: | ||
+ | remove_FreeText(args.path, restoreOut= args.ro) | ||
+ | </source> | ||
*Usage Statistics: | *Usage Statistics: | ||
** The null field 'some person' has been used<b> 223 </b> times | ** The null field 'some person' has been used<b> 223 </b> times | ||
=== Dates === | === Dates === | ||
+ | * {{Ticket | 71 | Fix Property End_date and invalid date entries }} | ||
* 153 improper start date entries. | * 153 improper start date entries. | ||
* 6 improper end date entries. | * 6 improper end date entries. | ||
Line 109: | Line 138: | ||
=== Acceptance Rate Issue === | === Acceptance Rate Issue === | ||
− | + | * {{Ticket|152|Acceptance Rate Not calculated}} | |
* Statistics for the missing values for Submitted papers: | * Statistics for the missing values for Submitted papers: | ||
* With bash utility these can be found | * With bash utility these can be found | ||
Line 115: | Line 144: | ||
** Number of pages that have the field "Accepted papers" : <b> 1965 </b> | ** Number of pages that have the field "Accepted papers" : <b> 1965 </b> | ||
* With a small python code snippet the following can be found: | * With a small python code snippet the following can be found: | ||
+ | Code Snippet: | ||
+ | <source lang='python'> | ||
+ | import glob | ||
+ | allx = glob.glob('path\to\backup\*.wiki') | ||
+ | nosub=0 | ||
+ | noacc=0 | ||
+ | for i in allx: | ||
+ | with open(i,'r') as f: | ||
+ | event =f.read() | ||
+ | if event.lower().find('submitted papers'.lower()) == -1 and event.lower().find('Accepted papers'.lower()) != -1: | ||
+ | nosub+=1 | ||
+ | elif event.lower().find('submitted papers'.lower()) != -1 and event.lower().find('Accepted papers'.lower()) == -1: | ||
+ | noacc+=1 | ||
+ | print(nosub) | ||
+ | print(noacc) | ||
+ | </source> | ||
** Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : <b>Approximately 63</b> | ** Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : <b>Approximately 63</b> | ||
** Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : <b>Approximately 302</b> | ** Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : <b>Approximately 302</b> |
Latest revision as of 10:37, 2 April 2021
Red Links and Data Fixations
Broken redirects
- 106 Red links, Natural Language Ids, ambigous, wrong and N/A values for country/region/city (and other fields)
- The list of all broken redirects on openresearch can be found here: https://www.openresearch.org/mediawiki/index.php?title=Special:BrokenRedirects&limit=500&offset=0
Broken file links
- 106 red links, Natural Language Ids, ambigous, wrong and N/A values for country/region/city (and other fields)
- List of all broken links which contain a file path but the file does not exist can be found here: https://www.openresearch.org/wiki/Category:Pages_with_broken_file_links
Broken properties and errors
- The property Property:Has improper value for stores all invalid values.
Ordinals
- 119 Text Values in Ordinal Field
- For the Fix for ordinals the following approaches can be used :
- This approach finds and edits all events with improper ordinals fixed:
wikiedit -t wikiId -q "[[Has improper value for::Ordinal]]" --search "(\|Ordinal=[0-9]+)(?:st|nd|rd|th)\b" --replace "\1"
- A code snippet can be used coupled with wikibackup and bash tools for specific editing of pages: Code Snippet
- Pipeline usage:
grep Ordinal /path/to/backup -l -r | python ordinal_to_cardinal.py -stdin -d '../dictionary.yaml' -ro -f | wikirestore -t ormk -stdinp -ui
Improper Null values for Has person
- 150 Improper Null Value handling
- Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name.
- First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep.
Code Snippet:
def remove_FreeText(path, restoreOut= False):
'''
Remove free text from any given wiki page
Args:
path(str): path to wiki file
Returns:
page without any free text.
'''
f = open(path, 'r')
eventlist = f.readlines()
i = 0
newwiki = ''
while i < (len(eventlist) - 1):
if "{{" in eventlist[i]:
while True:
newwiki += eventlist[i]
if "}}" in eventlist[i]:
break
i += 1
i += 1
if restoreOut:
if newwiki == '':
return
dirname = getHomePath + '/wikibackup/rdfPages/'
fullpath = dirname + ntpath.basename(path)
print(fullpath)
f.close()
ensureDirectoryExists(dirname)
f = open(fullpath, 'w')
f.write(newwiki)
f.close()
Usage:
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rmf | wikirestore -t ormk -stdinp -ui
Output result: ICFHR 2020
- Second way to do this is to only remove the 'some person' entry from the wiki free text. Python snippet is used with bash utility grep.
Code Snippet:
def remove_DuplicateFields(path, restoreOut= False):
f = open(path, 'r')
event = f.read()
flag = False
eventlist = event.split('\n')
for i in eventlist:
if 'some person' in i:
chk = re.findall('\[\[(.*?)\:\:', i)
if event.count(chk[0]) > 1:
flag = True
event = event.replace(i, '')
if restoreOut:
if flag:
dirname = getHomePath + '/wikibackup/rdfPages/'
fullpath = dirname + ntpath.basename(path)
print(fullpath)
f.close()
ensureDirectoryExists(dirname)
f = open(fullpath, 'w')
f.write(event)
f.close()
Usage:
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rdf | wikirestore -t ormk -stdinp -ui
output result: ACCV_2020
Main Function for usage:
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-p', dest="path", help='path to the file')
parser.add_argument('-stdin', dest="pipein", action='store_true', help='Use the input from STD IN using pipes')
parser.add_argument('-ro', dest="ro", action='store_true', help='Output to STDOUT for wikirestore by pipe')
parser.add_argument('-rmf', dest="rmf", action='store_true', help='flag to remove free text from wiki file')
parser.add_argument('-rdf', dest="rdf", action='store_true', help='flag to remove duplicate null value fields from wiki file')
args = parser.parse_args()
if args.pipein:
if args.rmf:
allx = sys.stdin.readlines()
for i in allx:
remove_FreeText(i.strip(), restoreOut=args.ro)
if args.rdf:
allx = sys.stdin.readlines()
for i in allx:
remove_DuplicateFields(i.strip(), restoreOut=args.ro)
elif args.path is not None:
if args.rmf:
remove_FreeText(args.path, restoreOut= args.ro)
- Usage Statistics:
- The null field 'some person' has been used 223 times
Dates
- 71 Fix Property End_date and invalid date entries
- 153 improper start date entries.
- 6 improper end date entries.
- Reference page
- End date or dates in general are placed with strings.
- Decision to remove the field all together or fix them?
- For fixing manual intervention is needed.
- For removing a field all together a small code snippet can do the trick
Acceptance Rate Issue
- 152 Acceptance Rate Not calculated
- Statistics for the missing values for Submitted papers:
- With bash utility these can be found
- Number of pages that have the field "Submitted papers" : 1716
- Number of pages that have the field "Accepted papers" : 1965
- With a small python code snippet the following can be found:
Code Snippet:
import glob
allx = glob.glob('path\to\backup\*.wiki')
nosub=0
noacc=0
for i in allx:
with open(i,'r') as f:
event =f.read()
if event.lower().find('submitted papers'.lower()) == -1 and event.lower().find('Accepted papers'.lower()) != -1:
nosub+=1
elif event.lower().find('submitted papers'.lower()) != -1 and event.lower().find('Accepted papers'.lower()) == -1:
noacc+=1
print(nosub)
print(noacc)
- Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : Approximately 63
- Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : Approximately 302