Workdocumentation 2021-03-30

From OPENRESEARCH mk copy Wiki
Jump to navigation Jump to search

Red Links and Data Fixations

Broken redirects

Broken file links

Broken properties and errors

  • The property Property:Has improper value for stores all invalid values.

Ordinals

  • 119 Text Values in Ordinal Field
  • For the Fix for ordinals the following approaches can be used :
    • This approach finds and edits all events with improper ordinals fixed:
wikiedit -t wikiId -q "[[Has improper value for::Ordinal]]" --search "(\|Ordinal=[0-9]+)(?:st|nd|rd|th)\b" --replace "\1"
  • A code snippet can be used coupled with wikibackup and bash tools for specific editing of pages: Code Snippet
  • Pipeline usage:
grep Ordinal /path/to/backup -l -r | python ordinal_to_cardinal.py -stdin -d '../dictionary.yaml' -ro -f | wikirestore -t ormk -stdinp -ui

Improper Null values for Has person

  • 150 Improper Null Value handling
  • Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name.
  1. First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep.

Code Snippet:

def remove_FreeText(path, restoreOut= False):
    '''
            Remove free text from any given wiki page
            Args:
                path(str): path to wiki file
            Returns:
                page without any free text.
            '''
    f = open(path, 'r')
    eventlist = f.readlines()
    i = 0
    newwiki = ''
    while i < (len(eventlist) - 1):
        if "{{" in eventlist[i]:
            while True:
                newwiki += eventlist[i]
                if "}}" in eventlist[i]:
                    break
                i += 1
        i += 1
    if restoreOut:
        if newwiki == '':
            return
        dirname = getHomePath + '/wikibackup/rdfPages/'
        fullpath = dirname + ntpath.basename(path)
        print(fullpath)
        f.close()
        ensureDirectoryExists(dirname)
        f = open(fullpath, 'w')
        f.write(newwiki)
        f.close()

Usage:

 
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rmf | wikirestore -t ormk -stdinp -ui

Output result: ICFHR 2020

  1. Second way to do this is to only remove the 'some person' entry from the wiki free text. Python snippet is used with bash utility grep.

Code Snippet:

def remove_DuplicateFields(path, restoreOut= False):
    f = open(path, 'r')
    event = f.read()
    flag = False
    eventlist = event.split('\n')
    for i in eventlist:
        if 'some person' in i:
            chk = re.findall('\[\[(.*?)\:\:', i)
            if event.count(chk[0]) > 1:
                flag = True
                event = event.replace(i, '')
    if restoreOut:
        if flag:
            dirname = getHomePath + '/wikibackup/rdfPages/'
            fullpath = dirname + ntpath.basename(path)
            print(fullpath)
            f.close()
            ensureDirectoryExists(dirname)
            f = open(fullpath, 'w')
            f.write(event)
            f.close()

Usage:

grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rdf | wikirestore -t ormk -stdinp -ui

output result: ACCV_2020


Main Function for usage:

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-p', dest="path", help='path to the file')
    parser.add_argument('-stdin', dest="pipein", action='store_true', help='Use the input from STD IN using pipes')
    parser.add_argument('-ro', dest="ro", action='store_true', help='Output to STDOUT for wikirestore by pipe')
    parser.add_argument('-rmf', dest="rmf", action='store_true', help='flag to remove free text from wiki file')
    parser.add_argument('-rdf', dest="rdf", action='store_true', help='flag to remove duplicate null value fields from wiki file')
    args = parser.parse_args()
    if args.pipein:
        if args.rmf:
            allx = sys.stdin.readlines()
            for i in allx:
                remove_FreeText(i.strip(), restoreOut=args.ro)
        if args.rdf:
            allx = sys.stdin.readlines()
            for i in allx:
                remove_DuplicateFields(i.strip(), restoreOut=args.ro)
    elif args.path is not None:
        if args.rmf:
            remove_FreeText(args.path, restoreOut= args.ro)
  • Usage Statistics:
    • The null field 'some person' has been used 223 times

Dates

  • 71 Fix Property End_date and invalid date entries
  • 153 improper start date entries.
  • 6 improper end date entries.
  • Reference page
  • End date or dates in general are placed with strings.
  • Decision to remove the field all together or fix them?
    • For fixing manual intervention is needed.
    • For removing a field all together a small code snippet can do the trick

Acceptance Rate Issue

  • 152 Acceptance Rate Not calculated
  • Statistics for the missing values for Submitted papers:
  • With bash utility these can be found
    • Number of pages that have the field "Submitted papers" : 1716
    • Number of pages that have the field "Accepted papers" : 1965
  • With a small python code snippet the following can be found:

Code Snippet:

import glob 
allx = glob.glob('path\to\backup\*.wiki')
nosub=0
noacc=0
for i in allx:
    with open(i,'r') as f:
        event =f.read()
        if event.lower().find('submitted papers'.lower()) == -1 and event.lower().find('Accepted papers'.lower()) != -1:
            nosub+=1
        elif event.lower().find('submitted papers'.lower()) != -1 and event.lower().find('Accepted papers'.lower()) == -1:
            noacc+=1
print(nosub)
print(noacc)
    • Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : Approximately 63
    • Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : Approximately 302