我经常要从xls文件中提取信息,然后根据这些信息加工和输出文件(方便其他程序读取)
本文以python2.5和xlrd 0.6.1的版本为工具来编写python代码。
xlrd 0.6.1新增了 XF Class,增加了对表格边框厚度、背景颜色以及文字大小、样式、颜色的支持,这些功能在实际使用中很有用,因为有时候需要通过读取这些信息,来判断表格内容的取舍。
下面是我实现的代码:
1 #!/usr/bin/env python
2
3 ###file:xlrdT3a.py
4
5 class readexcel(object):
6 """ Simple OS Independent Class for Extracting Data from Excel Files
7 the using xlrd module found at http://www.lexicon.net/sjmachin/xlrd.htm
8 Author:snowzjy Email:farsoftwind@gmail.com
9
10 Versions of Excel supported: 2004, 2002, XP, 2000, 97, 95, 5, 4, 3
11 xlrd version tested: 0.6.1
12
13 Data is extracted by creating a iterator object which can be used to
14 return data one row at a time. The default extraction method assumes
15 that the worksheet is in tabular format with the first nonblank row
16 containing variable names and all subsequent rows containing values.
17 This method returns a dictionary which uses the variables names as keys
18 for each piece of data in the row. Data can also be extracted with
19 each row represented by a list.
20
21 Extracted data is represented fairly logically. By default dates are
22 returned as strings in "yyyy/mm/dd" format or "yyyy/mm/dd hh:mm:ss",
23 as appropriate. However, dates can be return as a tuple containing
24 (Year, Month, Day, Hour, Min, Second) which is appropriate for usage
25 with mxDateTime or DateTime. Numbers are returned as either INT or
26 FLOAT, whichever is needed to support the data. Text, booleans, and
27 error codes are also returned as appropriate representations.
28
29 Quick Example:
30 xl = readexcel('testdata.xls')
31 sheetnames = xl.worksheets()
32 for sheet in sheetnames:
33 print sheet
34 for row in xl.getiter(sheet):
35 # Do Something here
36 """
37 def __init__(self, filename):
38 """ Returns a readexcel object of the specified filename - this may
39 take a little while because the file must be parsed into memory """
40 import xlrd
41 import os.path
42 if not os.path.isfile(filename):
43 raise NameError, "%s is not a valid filename" % filename
44 self.__filename__ = filename
45 self.__book__ = xlrd.open_workbook(filename,formatting_info=True)
46 self.__sheets__ = {}
47 self.__sheetnames__ = []
48 for i in self.__book__.sheet_names():
49 uniquevars = []
50 firstrow = 0
51 sheet = self.__book__.sheet_by_name(i)
52 for row in range(firstrow,sheet.nrows):
53 types,values = sheet.row_types(row),sheet.row_values(row)
54 nonblank = False
55 for j in values:
56 if j != '':
57 nonblank=True
58 break
59 if nonblank:
60 # Generate a listing of Unique Variable Names for Use as
61 # Dictionary Keys In Extraction. Duplicate Names will
62 # be replaced with "F#"
63 variables = self.__formatrow__(types,values,False)
64 unknown = 1
65 while variables:
66 var = variables.pop(0)
67 if var in uniquevars or var == '':
68 var = 'F' + str(unknown)
69 unknown += 1
70 uniquevars.append(str(var))
71 firstrow = row + 1
72 break
73 self.__sheetnames__.append(i)
74 self.__sheets__.setdefault(i,{}).__setitem__('rows',sheet.nrows)
75 self.__sheets__.setdefault(i,{}).__setitem__('cols',sheet.ncols)
76 self.__sheets__.setdefault(i,{}).__setitem__('firstrow',firstrow)
77 self.__sheets__.setdefault(i,{}).__setitem__('variables',uniquevars[:])
78 def getiter(self, sheetname, returnlist=False, returntupledate=False):
79 """ Return an generator object which yields the lines of a worksheet;
80 Default returns a dictionary, specifing returnlist=True causes lists
81 to be returned. Calling returntupledate=True causes dates to returned
82 as tuples of (Year, Month, Day, Hour, Min, Second) instead of as a
83 string """
84 if sheetname not in self.__sheets__.keys():
85 raise NameError, "%s is not present in %s" % (sheetname,\
86 self.__filename__)
87 if returnlist:
88 return __iterlist__(self, sheetname, returntupledate)
89 else:
90 return __iterdict__(self, sheetname, returntupledate)
91
92 def getxfiter(self, sheetname, returnlist=False, returntupledate=False):
93 """ Return an generator object which yields the lines of a worksheet;
94 Default returns a dictionary, specifing returnlist=True causes lists
95 to be returned. Calling returntupledate=True causes dates to returned
96 as tuples of (Year, Month, Day, Hour, Min, Second) instead of as a
97 string """
98 if sheetname not in self.__sheets__.keys():
99 raise NameError, "%s is not present in %s" % (sheetname,\
100 self.__filename__)
101 if returnlist:
102 return __XF_iterlist__(self, sheetname, returntupledate)
103 else:
104 return __XF_iterdict__(self, sheetname, returntupledate)
105
106 def worksheets(self):
107 """ Returns a list of the Worksheets in the Excel File """
108 return self.__sheetnames__
109 def nrows(self, worksheet):
110 """ Return the number of rows in a worksheet """
111 return self.__sheets__[worksheet]['rows']
112 def ncols(self, worksheet):
113 """ Return the number of columns in a worksheet """
114 return self.__sheets__[worksheet]['cols']
115 def variables(self,worksheet):
116 """ Returns a list of Column Names in the file,
117 assuming a tabular format of course. """
118 return self.__sheets__[worksheet]['variables']
119 def __formatrow__(self, types, values, wanttupledate):
120 """ Internal function used to clean up the incoming excel data """
121 ## Data Type Codes:
122 ## EMPTY 0
123 ## TEXT 1 a Unicode string
124 ## NUMBER 2 float
125 ## DATE 3 float
126 ## BOOLEAN 4 int; 1 means TRUE, 0 means FALSE
127 ## ERROR 5
128 import xlrd
129 returnrow = []
130 for i in range(len(types)):
131 type,value = types[i],values[i]
132 if type == 2:
133 if value == int(value):
134 value = int(value)
135 elif type == 3:
136 datetuple = xlrd.xldate_as_tuple(value, self.__book__.datemode)
137 if wanttupledate:
138 value = datetuple
139 else:
140 # time only no date component
141 if datetuple[0] == 0 and datetuple[1] == 0 and \
142 datetuple[2] == 0:
143 value = "%02d:%02d:%02d" % datetuple[3:]
144 # date only, no time
145 elif datetuple[3] == 0 and datetuple[4] == 0 and \
146 datetuple[5] == 0:
147 value = "%04d/%02d/%02d" % datetuple[:3]
148 else: # full date
149 value = "%04d/%02d/%02d %02d:%02d:%02d" % datetuple
150 elif type == 5:
151 value = xlrd.error_text_from_code[value]
152 returnrow.append(value)
153 return returnrow
154
155
156 def __iterlist__(excel, sheetname, tupledate):
157 """ Function Used To Create the List Iterator """
158 sheet = excel.__book__.sheet_by_name(sheetname)
159 for row in range(excel.__sheets__[sheetname]['rows']):
160 types,values = sheet.row_types(row),sheet.row_values(row)
161 yield excel.__formatrow__(types, values, tupledate)
162
163 def __iterdict__(excel, sheetname, tupledate):
164 """ Function Used To Create the Dictionary Iterator """
165 sheet = excel.__book__.sheet_by_name(sheetname)
166 for row in range(excel.__sheets__[sheetname]['firstrow'],\
167 excel.__sheets__[sheetname]['rows']):
168 types,values = sheet.row_types(row),sheet.row_values(row)
169 formattedrow = excel.__formatrow__(types, values, tupledate)
170 # Pad a Short Row With Blanks if Needed
171 for i in range(len(formattedrow),\
172 len(excel.__sheets__[sheetname]['variables'])):
173 formattedrow.append('')
174 yield dict(zip(excel.__sheets__[sheetname]['variables'],formattedrow))
175
176 #-----------------------------------------------------------------------------
177
178 def get_color(book, color, styleName) :
179 #Trying default value
180 if color == None:
181 color = book.colour_map[0x7FFF]
182 colorStyle=''
183 if not color == None:
184 colorStyle='%s(%d,%d,%d)' % (styleName,color[0],color[1],color[2])
185 return colorStyle
186
187 def __XF_iterlist__(excel, sheetname, tupledate):
188 import xlrd
189 """ Function Used To Create the Dictionary Iterator """
190 sheet = excel.__book__.sheet_by_name(sheetname)
191 ## if not sheet.nrows:
192 ## return None # skip empty sheets
193 merged ={}
194 skipped ={}
195 for crange in sheet.merged_cells:
196 rlo, rhi, clo, chi = crange
197 merged[(rlo,clo)]=(rhi-rlo,chi-clo) #colspan,rowspan
198 for rx in xrange(rlo, rhi):
199 for cx in xrange(clo, chi):
200 skipped[(rx,cx)] = True
201 skipped[(rlo,clo)] = False
202 for ridx in range(excel.__sheets__[sheetname]['rows']):
203 types,values = sheet.row_types(ridx),sheet.row_values(ridx)
204 rowvalue = excel.__formatrow__(types, values, tupledate)
205
206 bufxf = []
207 row = sheet.row(ridx)
208 if not filter(lambda cell: unicode(cell.value), row):
209 continue
210 print(u'\nrow%d' % (ridx + 1))
211 for cidx, cell in enumerate(row):
212
213 key = (ridx,cidx)
214 if key in skipped and skipped[key]:
215 continue #skipping - merged columns
216 formatIndex = sheet.cell_xf_index(ridx, cidx)
217 format = excel.__book__.computed_xf_list[formatIndex]
218 font = excel.__book__.font_list[format.font_index]
219 color = excel.__book__.colour_map[font.colour_index]
220 bgColor = excel.__book__.colour_map[format.background.pattern_colour_index]
221 style = 'normal'
222 if font.italic:
223 style='italic'
224 decor = 'none'
225 if font.underlined:
226 decor='underline'
227 colspan=''
228 rowspan=''
229 if key in merged:
230 spans = merged[key]
231 rowspan='rowspan="%d"' % spans[0]
232 colspan='colspan="%d"' % spans[1]
233
234 if not color == None and bgColor == None:
235 print(u'col%d' % (cidx + 1)),
236 #buf.append(u'style="font-weight:%(weight)d; font-style:%(style)s; %(colorStyle)s;
237
238 #\%(bgColorStyle)s;\
239 # text-decoration:%(decor)s " %(rowspan)s %(colspan)s>%(value)s' %
240 print(u'%(colorStyle)s;%(bgColorStyle)s' %
241 {
242 #"weight":font.weight,
243 #"style":style,
244 #"value":sheet.cell_value(ridx, cidx),
245 "colorStyle":get_color(excel.__book__,color,'fc'),
246 "bgColorStyle":get_color(excel.__book__,bgColor,'bc'),
247 #"decor":decor,
248 #"colspan":colspan,
249 #"rowspan":rowspan
250 }),
251
252 xfc = (rowvalue[cidx],get_color(excel.__book__,color,'fc'),get_color(excel.__book__,bgColor,'bc'))
253
254
255 bufxf.append(xfc)
256
257 yield bufxf
258
259 def __XF_iterdict__(excel, sheetname, tupledate):
260 import xlrd
261 """ Function Used To Create the Dictionary Iterator """
262 sheet = excel.__book__.sheet_by_name(sheetname)
263 ## if not sheet.nrows:
264 ## return None # skip empty sheets
265 merged ={}
266 skipped ={}
267 for crange in sheet.merged_cells:
268 rlo, rhi, clo, chi = crange
269 merged[(rlo,clo)]=(rhi-rlo,chi-clo) #colspan,rowspan
270 for rx in xrange(rlo, rhi):
271 for cx in xrange(clo, chi):
272 skipped[(rx,cx)] = True
273 skipped[(rlo,clo)] = False
274 for ridx in range(excel.__sheets__[sheetname]['firstrow'],\
275 excel.__sheets__[sheetname]['rows']):
276 types,values = sheet.row_types(ridx),sheet.row_values(ridx)
277 rowvalue = excel.__formatrow__(types, values, tupledate)
278
279 bufxf = []
280 row = sheet.row(ridx)
281 if not filter(lambda cell: unicode(cell.value), row):
282 continue
283 print(u'\nrow%d' % (ridx + 1))
284 for cidx, cell in enumerate(row):
285 key = (ridx,cidx)
286 if key in skipped and skipped[key]:
287 continue #skipping - merged columns
288 formatIndex = sheet.cell_xf_index(ridx, cidx)
289 format = excel.__book__.computed_xf_list[formatIndex]
290 font = excel.__book__.font_list[format.font_index]
291 color = excel.__book__.colour_map[font.colour_index]
292 bgColor = excel.__book__.colour_map[format.background.pattern_colour_index]
293 style = 'normal'
294 if font.italic:
295 style='italic'
296 decor = 'none'
297 if font.underlined:
298 decor='underline'
299 colspan=''
300 rowspan=''
301 if key in merged:
302 spans = merged[key]
303 rowspan='rowspan="%d"' % spans[0]
304 colspan='colspan="%d"' % spans[1]
305
306 if color != None or bgColor != None:
307 print(u'col%d' % (cidx + 1)),
308 #buf.append(u'style="font-weight:%(weight)d; font-style:%(style)s; %(colorStyle)s;
#\(bgColorStyle)s;\
309 # text-decoration:%(decor)s " %(rowspan)s %(colspan)s>%(value)s' %
310 print(u'%(colorStyle)s %(bgColorStyle)s;' %
311 {
312 #"weight":font.weight,
313 #"style":style,
314 #"value":sheet.cell_value(ridx, cidx),
315 "colorStyle":get_color(excel.__book__,color,'fc'),
316 "bgColorStyle":get_color(excel.__book__,bgColor,'bc'),
317 #"decor":decor,
318 #"colspan":colspan,
319 #"rowspan":rowspan
320 }),
321
322 xfc = (rowvalue[cidx],u'%(colorStyle)s'%{"colorStyle":get_color(excel.__book__,color,'fc')},\
323 u'%(bgColorStyle)s'%{"bgColorStyle":get_color(excel.__book__,bgColor,'bc')})
324 bufxf.append(xfc)
325
326 yield dict(zip(excel.__sheets__[sheetname]['variables'],bufxf))
__iterlist__ 和__iterdict__分别使用list和字典方式获取表格单元格的内容信息.__XF_iterlist__和__XF_iterdict__分别使用list和字典方式获取表格单元格的格式信息.
由于在我这里的程序中只用到字体颜色和单元格背景色,为了简化期间,将312-319的其他信息注了,如果要使用他们,取消注释,然后再将其加在xfc列表里就行了。
当我们要使用它从xls文件中提取内容信息时,可以如下面的范例使用:
1 #!/usr/bin/env python
2 from xlrdT3a import *
3
4
5
6 xl = readexcel('test.xls')
7 sheetnames = xl.worksheets()
8 for sheet in sheetnames:
9 stringfile = open(sheet, 'w')
10 print '\n'
11 print sheet
12 for row in xl.getiter(sheet):
13 #print row
14 str = u"%s \t\t\"%s\"\n" %(row['ID'],row['something'])
15 stringfile.write(str.encode('shift_jis','backslashreplace'))#'replace'
16
17 ## for row in xl.getiter(sheet,True):
18 ## str = u"%s \"%s\"\n" %(row[0],row[10])
19 ## stringfile.write(str.encode('shift_jis','backslashreplace'))#'replace'
20
21 stringfile.close()
第9行使用'w'写的方式打开文件,如果不存在,就自动创建.
其中12行的那个for循环是使用字典方式,如果使用列表方式,请使用17,18,19行注释中的代码替换就行了。
第15行编码使用日文,在这里,xls文件中是unicode,输出文件可以根据自己的需要修改就行了,也可以直接输出为unicode。
在xlrdT3a.py中第50行firstrow = 0,意思是讲将test.xls中的第一行的每个单元格的内容作为列的索引,上面程序的第14行就是这样使用的.如果要将表格的第二行内容作为列的索引,可以令
firstrow = 1 .当我们要使用它从xls文件中提取格式信息时,可以如下面的范例使用:
1 xl = readexcel('test.xls')
2 sheetnames = xl.worksheets()
3 for sheet in sheetnames:
4 buf = ''
5 stringfile = open(sheet,'r')
6 buf1 = stringfile.read()
7 buf = buf1.decode('shift_jis').encode('utf-8')
8 stringfile.close()
9 stringfile2 = open(sheet + '_fix','w')
10 print '\n'
11 print sheet
12
13
14 for row2 in xl.getxfiter(sheet,True):
15 if row2[10][2] == 'bc(255,255,0)':
16 str = u"%s \"%s\"\n" %(row2['ID'][0],row2['something'][0])
17 stringfile2.write(str.encode('shift_jis','backslashreplace'))#'replace'
18
19
20 ## for row2 in xl.getxfiter(sheet,True):
21 ## if row2[10][2] == 'bc(255,255,0)':
22 ## str = u"%s \"%s\"\n" %(row2[0][0],row2[10][0])
23 ## stringfile2.write(str.encode('shift_jis','backslashreplace'))#'replace'
24
25
26 stringfile2.close()
第17行和第23行
'backslashreplace'意思是将不能转换的字符转换为'\uxxxx'的unicode4位编码字符串,如果是'replace',则简单的将其替换为'?'
这个程序的意思是如果某个单元格的背景色为黄色,则将这个单元格所在行对应的'ID'和'something'列对应打印输出.